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Pre-K  Classroom- Economic  Composition  and  Children’s  Early 

Academic  Development 


Portia  Miller  and  Elizabeth  Votruba-Drzal 

University  of  Pittsburgh 


Meghan  McQuiggan 

American  Institutes  for  Research,  Washington,  DC 


Alyssa  Shaw 

National  Abortion  Federation,  Washington,  DC 

There  are  currently  2  principal  models  of  publicly  funded  prekindergarten  programs  (pre-K):  targeted 
pre-K,  which  is  means-tested,  and  universal  pre-K.  These  programs  often  differ  in  terms  of  the  economic 
characteristics  of  the  preschoolers  enrolled.  Studies  have  documented  links  between  individual  achieve¬ 
ment  in  school-age  children  and  the  economic  composition  of  classroom  peers,  but  little  research  has 
revealed  whether  these  associations  hold  in  pre-K  classrooms.  Using  data  from  2,966  children  in  709 
pre-K  classrooms,  we  examined  whether  classroom-economic  composition  (i.e.,  average  family  income, 
standard  deviation  of  incomes,  and  percentage  of  students  from  low-income  households)  relates  to 
achievement  in  preschool.  Furthermore,  this  study  investigated  whether  associations  between  classroom- 
economic  composition  and  achievement  differed  depending  on  initial  academic  skill  level.  Increased 
economic  advantage  in  pre-K  classrooms  positively  predicted  spring  achievement.  Specifically,  increas¬ 
ing  aggregate  classroom  income  between  $22,500  and  $62,500  was  related  to  improvements  in  math 
scores  Increases  in  the  proportion  of  children  from  low-income  households  in  the  classroom  were 
negatively  related  to  both  math  and  literacy  and  language  skills  when  increases  occurred  between  52.5% 
and  72.5%  and  25%  and  45%,  respectively.  There  was  limited  evidence  that  links  between  classroom- 
economic  composition  and  achievement  differed  depending  on  initial  skill  level.  Results  suggest  that 
economically  integrated  pre-K  programs  may  be  more  beneficial  to  preschoolers  from  low-income 
households’  achievement  than  classrooms  targeting  economically  disadvantaged  children. 

Keywords:  prekindergarten,  academic  achievement,  classroom-economic  composition 


Not  all  children  begin  kindergarten  on  equal  footing  academi¬ 
cally.  Disparities  in  children’s  academic  skills  at  kindergarten 
entry  related  to  family  socioeconomic  status  (SES)  are  well  doc¬ 
umented  (e.g.,  Duncan  &  Magnuson,  2011;  Garcia,  2015).  Increas¬ 
ingly,  prekindergarten  (pre-K) — defined  here  as  publicly  funded, 
center-based  preschool  programs  attended  1-2  years  before  kin¬ 
dergarten— has  been  heralded  as  a  policy  lever  to  narrow  early 
socioeconomic  gaps  in  achievement  (Magnuson  &  Waldfogel, 
2005).  Indeed,  mounting  evidence  shows  positive  impacts  of  pre-K 
on  early  academic  skills  (Gormley,  Gayer,  Phillips,  &  Dawson, 
2005;  Henry  et  al.,  2003;  Magnuson,  Meyers,  Ruhm,  &  Waldfogel, 
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2004;  Reynolds  &  Temple,  1995;  Weiland  &  Yoshikawa,  2013). 
Studies  have  found  overwhelmingly  that  the  benefits  of  pre-K  are 
stronger  for  socioeconomically  disadvantaged  children  than  their 
more  advantaged  peers  (Magnuson  et  al.,  2004;  Magnuson,  Ruhm, 
&  Waldfogel,  2007;  Votruba-Drzal,  Coley,  Koury,  &  Miller, 

2013) . 

As  government  budgets  become  more  constrained  and  demand 
for  pre-K  programs  grows,  a  key  question  for  early-childhood 
education  researchers  and  policymakers  is  whether  it  is  more 
effective  to  offer  targeted  pre-K  programs  that  limit  enrollment 
solely  to  economically  disadvantaged  children  or  to  implement 
universal  pre-K,  which  is  available  to  all  children  regardless  of 
family  income.  Head  Start,  for  example,  is  the  only  federally 
funded  preschool  program  in  the  United  States  and  is  targeted, 
serving  nearly  one  million  children  from  low-income  households 
(Office  of  Head  Start,  2013).  In  addition,  40  states  fund  pre-K 
programs,  and  although  most  of  these  programs  serve  disadvan¬ 
taged  children,  who  are  most  often  defined  as  children  with  family 
incomes  below  185%-200%  of  the  federal  poverty  level  (FPL), 
four  states  and  the  District  of  Columbia  have  adopted  universal 
pre-K  programs  (Barnett,  Carolan,  Squires,  Brown,  &  Horowitz, 

2014) .  These  varied  approaches  to  pre-K  may  lead  to  differences 
in  the  concentration  of  children  from  low-income  households  in 
preschool  classes,  thereby  raising  important  questions  regarding 
the  role  of  classroom-economic  composition — the  collective  eco- 
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nomic  characteristics  (e.g.,  family  income,  poverty  status)  of  stu¬ 
dents  in  a  classroom — in  shaping  the  development  of  children’s 
school  readiness  skills.  Although  many  studies  have  documented 
positive  relations  between  increased  classroom  socioeconomic  ad¬ 
vantage  and  individual  children’s  achievement  in  elementary  and 
secondary  school  (see  van  Ewijk  &  Sleegers,  2010),  scant  research 
has  examined  whether  these  associations  hold  in  preschool  class¬ 
rooms  and  for  whom  they  are  strongest. 

Analyzing  data  from  2,966  English-  and/or  Spanish-speaking 
children  in  709  pre-K  classrooms  across  1 1  states,  this  study 
examined  whether  aggregate  pre-K  classroom-economic  composi¬ 
tion,  including  average  family  income,  variability  of  classroom 
incomes,  and  proportion  of  children  in  the  class  that  are  low- 
income,  is  related  to  individual  academic  achievement  across  the 
pre-K  year.  Moreover,  we  considered  whether  associations  be¬ 
tween  classroom-economic  composition  and  achievement  are  lin¬ 
ear  or  whether  they  vary  across  levels  of  aggregate  economic 
advantage.  Finally,  we  investigated  whether  links  between 
classroom-economic  composition  and  individual  achievement  dif¬ 
fer  for  children  based  on  their  academic  skills  at  the  start  of  pre-K. 
This  study  seeks  to  inform  programs  and  policies  aimed  at  enhanc¬ 
ing  early  education  and  school  readiness  by  elucidating  how  the 
economic  characteristics  of  children  in  a  preschool  classroom 
relate  to  individual  children’s  learning. 

Theoretical  Framework  for  Peer-Economic 
Compositional  Effects  in  Pre-K 

Children’s  academic  development  during  pre-K  is  driven  by 
proximal  processes  and  interactions  in  the  classroom  context 
(Creemers  &  Reezigt,  1996).  Children  both  affect  and  are  influ¬ 
enced  by  the  classroom  learning  environment.  Proximal  processes 
and  interactions  within  the  classroom  are  also  shaped  by  the  collective 
characteristics  of  students  within  it,  including  classroom-economic 
composition.  When  thinking  about  pre-K  classrooms  in  particular, 
classroom-economic  composition  likely  affects  children’s  learning  in 
three  ways:  by  influencing  (a)  peer  interactions,  (b)  teachers’  instruc¬ 
tion  practices  and  interactions  with  children,  and  (c)  the  classroom 
structure. 

First,  children  may  be  directly  impacted  by  peers  in  their  class¬ 
rooms  through  day-to-day  interactions  that  provide  opportunities 
for  modeling  and  reinforcement  of  behaviors  and  skills  (e.g.,  Bub, 
McCartney,  &  Willet,  2007;  Raver  et  al.,  2011;  Schechter  &  Bye, 
2007).  Children  from  low-income  homes  tend  to  possess  fewer 
academic  skills  and  exhibit  more  behavioral  problems  than  their 
more  advantaged  peers  (Magnuson  &  Votruba-Drzal,  2009).  Thus, 
economically  integrated  classrooms  may  expose  children  from 
low-income  households  to  peers  that  model  more  advanced  aca¬ 
demic  skills  and  adaptive  approaches  to  learning  and  classroom 
behavior.  Conversely,  in  more  disadvantaged  pre-K  classrooms, 
larger  numbers  of  disruptive  and  aggressive  peers  may  model 
maladaptive  behavior  that  may  be  reinforced  by  classmates  (Bat- 
tistich  et  ah,  1995;  Dishion,  Spracklen,  Andrews,  &  Patterson, 
1996),  which  could  interfere  with  learning  by  negatively  impacting 
children’s  attention  and  classroom  behavior  (Georges,  Brooks- 
Gunn,  &  Malone,  2012;  Hinshaw,  1992;  Neidell  &  Waldfogel, 
2010). 

Second,  differences  in  classroom-economic  composition  may 
elicit  different  instruction  from  teachers.  Disadvantaged  children 


tend  to  receive  less  instructive  and  evaluative  feedback  and  engage 
in  fewer  responsive  and  positive  interactions  with  teachers  (e.g., 
Arnold,  1997;  Connor,  Son,  Hindman,  &  Morrison,  2005).  Class¬ 
rooms  with  high  levels  of  disadvantage  are  often  characterized  by 
less  constructivist,  student-centered  instruction  (Stipek,  2004), 
which  may  inhibit  achievement.  In  addition,  if  greater  numbers  of 
students  in  highly  disadvantaged  pre-Ks  are  struggling  with  stress 
at  home,  learning  delays,  or  attention  and  behavior  problems, 
teachers  may  spend  more  time  tending  to  these  needs  instead  of 
focusing  on  instruction  (e.g.,  Carr,  Taylor,  &  Robinson,  1991; 
Dreeben  &  Barr,  1988;  Gamoran,  1986;  Lavy,  Paserman,  & 
Schlosser,  2012;  Pallas,  Entwisle,  Alexander,  &  Slutka,  1994). 
Indeed,  the  average  academic  skills  of  children  in  a  class  predict 
teachers’  pace  and  level  of  instruction,  as  well  as  their  academic 
expectations  for  students  (Dreeben  &  Barr,  1988;  Pallas  et  al., 
1994),  and  increases  in  aggregate  behavior  problems  tend  to  re¬ 
duce  instructional  time  (Carr  et  al.,  1991).  Thus,  we  may  expect 
that,  in  pre-K  classrooms  with  higher  average  incomes  and  lower 
proportions  of  children  from  low-income  households,  students 
may  have  more  learning  opportunities  because  teachers  spend 
more  time  teaching,  engage  in  more  complex,  student-centered 
instruction,  and  have  more  positive  interactions  with  children. 

Finally,  the  structure  of  pre-K  classrooms  may  be  impacted  by 
classroom-economic  composition.  Increased  classroom  economic 
advantage  predicts  more  time  in  free  play,  less  time  spent  in 
routine  activities  like  getting  in  line,  cleaning  up  after  meals, 
transitioning  between  activities,  and  more  time  spent  in  learning 
activities  (Early  et  al.,  2010).  Thus,  in  less  economically  advan¬ 
taged  classrooms,  children’s  academic  skills  may  grow  more 
slowly  if  they  spend  a  smaller  amount  of  time  in  learning  activities 
and  free  play,  both  of  which  have  been  tied  to  early  language, 
literacy,  and  math  achievement  (e.g.,  Cabell,  DeCoster,  LoCasale- 
Crouch,  Hamre,  &  Pianta,  2013;  Connor  et  al.,  2005;  Ginsburg, 
Lee,  &  Boyd,  2008). 

It  is  critical  to  note  that  not  all  theories  of  peer  effects  posit 
benefits  of  classroom  integration.  Some  models  contend  that  being 
instructed  with  peers  of  similar  SES,  and  presumably  abilities, 
produces  optimal  achievement  outcomes  because  teachers  can 
direct  their  instruction,  lesson  plans,  and  materials  to  the  predom¬ 
inant  skill  level  (e.g.,  Hoxby  &  Weingarth,  2005).  Similarly, 
theories  of  relative  disadvantage,  sometimes  referred  to  as  the 
“frog  pond"  perspective,  highlight  the  negative  consequences  of 
socioeconomically  integrated  classrooms  because  children  may  be 
evaluated  by  their  relative  standing  in  the  classroom,  that  is,  being 
the  “small  frog  in  the  pond”  (Crosnoe,  2009;  Marsh  &  Hau,  2003). 
It  suggests  that  poor  students  may  face  greater  competition  for 
grades  and  more  risks  for  stigmatization  in  economically  inte¬ 
grated  classes  than  in  classrooms  with  similarly  situated  peers. 
Thus,  a  measure  of  the  standard  deviation  or  spread  among  stu¬ 
dents’  economic  backgrounds  is  an  important  compositional  factor 
that  may  affect  individual  achievement. 

t 

Classroom-Economic  Composition  and 
Individual  Achievement 

Relations  between  peer-compositional  characteristics  and  indi¬ 
vidual  achievement  were  noted  as  early  as  1966  in  the  Equality  of 
Educational  Opportunity  Study,  which  is  commonly  referred  to  as 
“the  Coleman  report.”  Findings  suggested  that  aggregate  socioeco- 
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nomic  characteristics  at  the  school-level  were  more  important  in 
explaining  individual  student  achievement  than  factors  like  school 
facilities,  curriculum,  or  teaching  quality  (Coleman  et  al.,  1966). 
Since  then,  an  abundance  of  literature  has  documented  associations 
between  economic  characteristics  at  the  classroom-  or  school-level 
and  individual  students’  achievement  (see  van  Ewijk  &  Sleegers, 
2010;  Teddlie,  Stringfield,  &  Reynolds,  2000  for  reviews).  A 
meta-analysis  of  30  studies  concluded  that  the  average  family 
income  of  peers  has  a  moderate  ( d  >=»  0.25  SD )  positive  association 
with  individual  achievement  (van  Ewijk  &  Sleegers,  2010). 

Researchers  attempting  to  account  for  omitted-variable  bias 
(i.e.,  bias  occurring  when  estimates  do  not  control  for  important 
causal  factors  that  are  correlated  with  economic  characteristics  of 
the  classroom  and  individual  student  achievement)  obtained  effect 
sizes  that  were  smaller  than  studies  making  no  such  efforts  (e.g., 
Hutchison,  2003;  Rivkin,  2001;  Strand,  1998).  This  is  an  important 
consideration  given  that  children  are  not  randomly  assigned  to 
classrooms  or  schools,  and  factors  that  lead  children  to  a  particular 
classroom  or  school  may  also  be  related  to  children’s  SES  and 
development.  Also,  studies  using  different  measures  of  classroom- 
economic  composition  often  have  conflicting  results.  This  is  es¬ 
pecially  evident  when  examining  percent  of  the  class/school  that  is 
low  income  as  a  measure  of  classroom-economic  composition, 
which  is  most  often  operationalized  as  percent  eligible  for  free-  or 
reduced-price  lunch  (FRL).  This  measure  is  salient  to  the  universal 
versus  targeted  pre-K  debate  because  FRL  eligibility  is  often  used 
as  a  criterion  for  enrollment.  Some  studies  have  found  negative 
associations  between  increased  numbers  of  low-income  peers  and 
individual  achievement  (e.g.,  Cooley,  2010),  but  others  uncover  no 
links  (e.g.,  Bankston  &  Caldas,  1998;  Hanushek,  Kain,  Markman, 
&  Rivkin,  2003).  Greater  inconsistencies  and  null  effects  in  studies 
using  school  reports  of  FRL  eligibility  instead  of  direct  reports  of 
family  resources  are  not  surprising,  given  that  it  is  an  error-prone 
measure  of  economic  disadvantage  (Harwell  &  LeBeau,  2010).  A 
goal  of  the  current  study  was  to  examine  links  between  classroom- 
economic  composition  and  individual  achievement  using  a  mea¬ 
sure  derived  from  parental  reports  of  household  income. 

It  is  noteworthy  that  studies  linking  classroom-economic  com¬ 
position  to  individual  achievement  have  generally  used  school- 
aged  samples.  Some  of  the  hypothesized  mechanisms  for  peer 
effects,  such  as  academic  tracking,  competition  for  grades,  and 
self-evaluation  based  on  one’s  standing  relative  to  peers  (e.g., 
Crosnoe,  2009),  are  less  relevant  in  the  preschool  years.  Accord¬ 
ingly,  it  is  not  clear  that  prior  findings  generalize  to  pre-K  class¬ 
rooms.  The  lone  exception,  a  study  using  data  from  the  National 
Center  for  Early  Development  and  Learning  (NCEDL;  Reid  & 
Ready,  2013),  confounded  income  and  maternal  education,  which 
limited  its  ability  to  inform  pre-K  policy  because  program  eligi¬ 
bility  is  generally  based  on  income,  not  education.  For  the  present 
study,  we  controlled  for  other  child,  family,  and  classroom  char¬ 
acteristics,  including  average  maternal  education  level,  to  explore 
the  unique  role  of  classroom-economic  composition  in  predicting 
achievement. 

Nonlinearities  in  Relations  Between 
Classroom-Economic  Composition  and  Achievement 

Existing  research  on  classroom-economic  composition  and 
achievement  is  based  on  the  assumption  that  the  association  is 


linear,  which  suggests  that  an  increase  in  mean  classroom  income 
or  the  proportion  of  children  from  low-income  households  is 
related  to  the  same  achievement  growth,  regardless  of  the  level  of 
classroom  economic  characteristics.  This  presumes  that  moving 
from  a  classroom  with  a  mean  income  level  of  $20,000  to  a 
classroom  with  a  mean  income  level  of  $30,000  predicts  the  same 
gains  in  achievement  as  moving  from  a  classroom  with  mean 
income  of  $80,000  to  one  with  $90,000.  This  assumption  has  not 
been  empirically  tested  and  is  contrary  to  research  on  links  be¬ 
tween  individual  income  and  achievement  that  has  found  that 
increased  income  predicts  larger  achievement  gains  for  children 
from  low-income  families  (e.g.,  Duncan,  Ziol-Guest,  &  Kalil, 
2010). 

Furthermore,  evidence  from  the  broader  peer-effects  literature 
suggests  that  nonlinear  associations  should  be  considered  (e.g., 
Hoxby  &  Weingarth,  2005;  Lazear,  2001;  Neidell  &  Waldfogel, 
2010).  For  example,  studies  have  uncovered  evidence  of  a  “tipping 
point”  of  peer  effects,  whereby  the  positive  effects  of  increasing 
numbers  of  more  advantaged  or  skilled  students  are  not  evident 
until  the  proportion  of  skilled  students  in  the  class/school  reaches 
a  certain  point  (e.g.,  Lazear,  2001;  Neidell  &  Waldfogel,  2010), 
though  at  least  one  study  observes  the  opposite  pattern — positive 
effects  of  high-achieving  peers  diminish  as  classes  get  more  skilled 
(Zimmer  &  Toma,  2000).  A  novelty  of  the  present  study  is  its  use 
of  nonparametric  modeling  techniques  to  examine  nonlinear  links 
between  classroom-economic  composition  and  achievement. 

Differences  in  Associations  by  Individual  Ability 

Research  on  economic  composition  and  individual  achievement 
has  not  included  exploration  of  whether  these  relations  depend  on 
individual  characteristics,  despite  evidence  of  such  moderation  in 
the  broader  peer-effects  literature.  For  instance,  several  studies 
have  shown  that  links  between  peers’  academic  abilities  and 
achievement  are  strongest  for  the  lowest  performing  students 
(Burke  &  Sass,  2013;  Justice,  Petscher,  Schatschneider,  &  Mash- 
bum,  2011;  Zimmer  &  Toma,  2000).  However,  findings  are  mixed, 
with  some  studies  failing  to  find  moderation  by  initial  skill  level 
(Hanushek  et  al.,  2003)  and  others  showing  harmful  associations  of 
lower  skilled  peers  and  achievement  for  students  with  more  advanced 
academic  skills  (Imberman,  Kugler,  &  Sacerdote,  2012,  see  also 
Mashbum,  Justice,  Downer,  &  Pianta,  2009).  These  studies  were 
investigations  of  peer-academic  ability  as  the  compositional  predictor. 
However,  they  guided  our  examination  of  moderation  by  individual 
ability  in  the  present  study  of  pre-K-classroom-economic  composi¬ 
tion’s  role  in  predicting  achievement. 

Research  Aims 

This  study  had  three  research  aims,  the  first  of  which  was  to 
examine  whether  classroom-economic  composition,  as  measured 
by  average  family  income,  standard  deviation  of  classroom  in¬ 
comes,  and  percentage  of  students  from  low-income  households, 
was  associated  with  student  achievement  in  pre-K.  Particular  at¬ 
tention  was  given  to  examining  whether  these  links  are  nonlinear. 
The  second  aim  was  to  estimate  the  size  of  associations  between 
classroom-economic  composition  and  academic  skills  across  class¬ 
rooms  of  varied  economic  composition.  Third  was  to  consider 
whether  relations  between  classroom-economic  composition  and 
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achievement  differed  based  on  children’s  initial  achievement. 
Based  on  prior  research,  we  hypothesized  that  classroom  advan¬ 
tage  (i.e.,  higher  aggregate-mean  income  and  lower  percentages  of 
students  from  low-income  households)  would  positively  predict 
individual  achievement,  but  we  did  not  expect  significant  links 
between  the  standard  deviation  of  classroom  incomes  and  individ¬ 
ual  achievement.  We  also  hypothesized  threshold  effects  of 
classroom-economic  composition,  whereby  classroom  advantage 
would  be  more  strongly  associated  with  achievement  in  the  most 
disadvantaged  classrooms.  Last,  we  predicted  that  the  benefits  of 
classroom  advantage  would  be  the  most  pronounced  for  children 
with  less  advanced  academic  skills  at  the  start  of  pre-K. 

Method 

Participants 

Data  for  this  study  were  drawn  from  two  prospective  evalua¬ 
tions  conducted  by  the  NCEDL  of  state-funded  pre-K  programs: 
the  Multi-State  Study  of  Pre-Kindergarten  (Multi-State)  and  the 
State-Wide  Early  Education  Programs  Study  (SWEEP;  Early  et  al., 
2005).  The  Multi-State  and  the  SWEEP  included  programs  from 
1 1  states1  that  had  traditionally  high  rates  of  pre-K  enrollment  and 
were  diverse  in  terms  of  geography,  dominant  type  of  pre-K  model 
applied,  and  program  characteristics.  Sampled  programs  included 
a  mix  of  universal  and  targeted  pre-K,  including  Head  Start  (Dot- 
terer,  Burchinal,  Bryant,  Early,  &  Pianta,  2013).  Six  states  con¬ 
tained  only  targeted  programs  (i.e.,  those  using  income  require¬ 
ments  ranging  from  110%-350%  of  the  FPL),  four  had  universal 
programs,  and  one  state  contained  both.  Across  both  the  Multi- 
State  and  the  SWEEP,  63%  of  children  were  enrolled  in  targeted 
programs  and  37%  in  universal  programs.  Universal  and  targeted 
programs  varied  widely  in  terms  of  classroom-economic  compo¬ 
sition.  Classroom  aggregate  income  in  universal  programs  aver¬ 
aged  approximately  $44,000,  with  a  low  of  $8,000,  a  high  of 
$85,000,  and  a  standard  deviation  of  about  $20,000.  On  average, 
the  percentage  of  pre-K  students  that  were  low-income  in  universal 
classrooms  was  just  under  half  (48.4%),  but  ranged  from  0%  to 
100%.  Thus,  the  universal  programs  in  this  sample  were  econom¬ 
ically  heterogeneous.  As  expected,  targeted  programs  were  more 
disadvantaged  and  less  economically  integrated.  Average  aggre¬ 
gate  income  was  about  $25,000,  though  it  ranged  from  $7,000  to 
$78, 000, 2  and  there  was  less  variability  in  family  incomes  (e.g., 
standard  deviation  was  less  than  $12,500).  The  classrooms  in 
targeted  programs  were  predominantly  made  up  of  students  from 
low-income  households  (81.4%),  with  a  range  of  10%-100%. 

The  investigators  of  both  studies  randomly  sampled  sites  (i.e., 
pre-K  centers)  within  states,  one  classroom  within  each  site,  and 
four  children  within  each  classroom  (see  Early  et  al.,  2005  for  a 
description  of  the  sampling  process).  Multi-State  investigators 
used  a  multistage,  random  sampling  process  in  20  zip  codes  from 
each  state,  two  sites  from  each  selected  zip  code,  one  pre-K 
classroom  from  each  selected  site,  and  four  pre-K  children  from 
each  selected  classroom.  This  random  sample  was  stratified 
within-state  by  teacher  education,  program  location  (inside  vs. 
outside  of  a  school),  and  program  type  (full-  vs.  part-day  pro¬ 
grams).  Pre-K  data  collection  for  the  Multi-State  took  place  during 
the  2001-2002  school  year.  Of  the  335  sites  contacted,  238  sites 
participated  in  the  fall  and  two  additional  sites  agreed  to  in  the 


spring.  SWEEP  pre-K  data  collection  occurred  during  the  2003- 
2004  school  year.  The  investigators  recruited  a  random  sample  of 
100  pre-K  sites  per  state,  stratified  by  county  or  district.  In  total, 
465  sites  participated  in  the  SWEEP  in  the  fall,  and  463  of  those 
continued  participation  in  the  spring.  In  both  studies,  target  stu¬ 
dents  that  left  their  pre-K  classrooms  between  fall  and  spring 
assessments  or  dropped  out  of  the  study  were  replaced  with  ran¬ 
domly  selected  students  from  the  same  classroom.  Four  target 
children  switched  to  another  pre-K  classroom  within  the  same  site. 
These  children  remained  in  the  study,  which  added  another  four 
classrooms  to  the  sample,  and  another  four  children  in  the  original 
classrooms  were  added.3  There  were  142  students  who  dropped 
out  of  the  study  and  were  replaced  with  new  students  from  the 
same  classrooms.  Both  the  Multi-State  and  the  SWEEP  employed 
the  same  measures  and  training,  which  allowed  us  to  collapse  the 
two  datasets  to  address  our  research  aims.4  Thus,  across  the 
Multi-State  and  SWEEP,  we  analyzed  data  collected  from  2,966 
children,  who  ranged  in  age  from  3.8  to  5.7  years  of  age  ( M  =  4.6) 
at  the  start  of  pre-K,  in  709  classrooms  in  1 1  states  for  the  current 
study. 

Procedure 

In  the  classrooms  randomly  selected  for  participation,  families 
of  all  children  received  packets  containing  a  consent  form  and  a 
demographic  questionnaire.  Data  from  this  demographic  question¬ 
naire  were  pooled  across  all  children  in  the  classroom  to  generate 
measures  of  classroom-economic  composition.  Four  children  were 
randomly  selected  as  target  children  from  all  children  in  a  class¬ 
room.  Direct  assessments  of  the  target  children’s  academic  skills 
were  performed.  In  addition,  teachers  completed  questionnaires 
about  the  target  children,  and  answered  questions  regarding  their 
own  educational  background  and  classroom  characteristics. 

Direct  assessments  of  target  children’s  academic  skills  were 
conducted  in  the  fall  and  spring  of  pre-K.  Children  who  spoke  a 
language  other  than  English  at  home  were  given  a  portion  of  the 
preLAS,  an  English  language  proficiency  assessment,  (Duncan  & 
De  Avila,  1998)  to  screen  for  English  proficiency.  Children  who 
did  not  score  at  least  31  of  40  possible  points  and  spoke  Spanish 
at  home  ( n  *=  300)  were  given  assessments  administered  in  Span¬ 
ish  (either  using  the  Spanish-language  equivalents  of  the  English 
tests  or  by  translating  the  tasks  into  Spanish).  Children  who  did  not 
pass  the  pre LAS  and  spoke  a  language  other  than  Spanish  at  home 
were  not  assessed.  All  children  who  were  administered  direct 
assessments  are  included  in  our  analysis.  We  combined  the  English 
and  Spanish  measures  to  retain  the  maximum  sample  of  children 


1  The  six  states  included  in  the  Multi-State  Study  were  California, 
Georgia,  Illinois,  Kentucky,  New  York,  and  Ohio.  The  five  states  included 
in  SWEEP  were  New  Jersey,  Massachusetts,  Texas,  Wisconsin,  and  Wash¬ 
ington. 

2  Although  it  is  unusual  that  a  “targeted”  pre-K  classroom  would  have  an 
average  income  as  high  as  $78,000,  researchers  have  documented  that  there 
are  often  circumstances  in  which  pre-K  programs  classified  as  targeted 
enroll  students  that  are  not  low-income  (Early  et  al.,  2005). 

The  classrooms  that  these  students  came  from  and  entered  into  were 
not  significantly  different  on  the  classroom  measures  collected  in  the  fall 
and  spring  (p  =  .1-9). 

4  Model  noninvariance  was  tested  by  interacting  an  indicator  for  data  set 
with  our  key  variables  of  interest.  There  were  no  significant  interactions, 
which  further  justified  pooling  the  data. 
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and  controlled  for  the  language  of  assessment.  This  is  consistent 
with  similar  preschool  studies  (e.g.,  Parrish  &  Howes,  2008; 
Wong,  Cook,  Barnett,  &  Jung,  2008),  and  the  results  of  this  study 
are  robust  to  the  exclusion  of  these  children. 

Measures 

Literacy  and  language  skills.  Children’s  early  literacy  and 
language  skills  were  assessed  using  three  measures.  The  Multi- 
State  and  SWEEP  (Early  et  al.,  2005)  assessed  letter-identification 
skills  by  showing  children  a  set  of  mixed  capital  and  lowercase 
letters  and  asking  them  to  identify  as  many  letters  as  they  could 
(Bryant,  Barbarin,  &  Aytch,  2001).  Scores  on  this  assessment 
range  from  0-26  (a  =  .97).  The  NCEDL  Early  Writing  Task 
(NEWT)  was  used  to  assess  children’s  emergent  literacy  skills 
(NCEDL,  2005).  Children  were  asked  to  write  their  names,  as 
name  writing  has  been  found  to  be  an  important  indicator  of  early 
literacy  and  language  knowledge  (Bloodgood,  1999;  Whitehurst  & 
Lonigan,  1998).  The  proportion  of  the  name  the  child  could  write 
legibly,  ranging  from  0—100%,  was  coded.  Several  research  mem¬ 
bers  coded  50  writing  samples  to  establish  reliability.  The  k  value 
had  a  mean  range  of  .76 -.89,  and  coders  scored  exactly  the  same 
or  within  one  percentage  point  87%-97%  of  the  time.  After  es¬ 
tablishing  reliability,  NEWT  scores  were  coded  by  one  researcher. 
Language  skills  for  the  English-speaking  children  were  assessed 
using  Dunn  and  Dunn’s  (1997)  Peabody  Picture  Vocabulary  Test 
(3rd  ed.;  PPVT-III;  fall:  a  =  .96,  spring:  a  =  .96)  or  the  Test  de 
Vocabulario  en  Imageries  Peabody  (TVIP;  fall:  a  =  .92,  spring: 
a  =  .93;  Dunn,  Padilla,  Lugo,  &  Dunn,  1986)  for  the  Spanish¬ 
speaking  children.  The  raw  PPVT-III  and  TVIP  scores  were  used 
because  the  standard  scores  are  standardized  on  different  popula¬ 
tions  (English-speaking  U.S.  children  vs.  children  in  Puerto  Rico 
and  Mexico,  respectively).  These  measures  correlated  at  around 
.40.  To  create  a  combined  measure  of  literacy  and  language,  scores 
on  these  three  assessments  were  standardized  (within  time)  and 
averaged.5  We  used  children’s  literacy  and  language  scores  from 
the  spring  assessment  as  the  outcome.  To  reduce  omitted-variable 
bias,  fall  scores  were  included  as  an  independent  variable  to 
control  for  unmeasured,  time-invariant  differences  in  children  and 
families  that  affect  children’s  achievement  and  may  be  correlated 
with  the  key  independent  variables  of  interest  (Chase-Lansdale  et 
al.,  2003). 

Math  skills.  Investigators  administered  the  Applied  Problems 
subtest  of  the  Woodcock-Johnson  III  Tests  of  Achievement  (Wood¬ 
cock,  McGrew,  &  Mather,  2001;  fall:  a  =  .84,  spring:  a  —  .83)  to 
measure  children’s  math  reasoning  and  problem-solving  skills.  It 
requires  the  child  to  analyze  and  solve  math  problems  by  perform¬ 
ing  relatively  simple  calculations.  Children  not  passing  the  English 
language  proficiency  screener  were  given  the  Bateria  Woodcock- 
Munoz-Revisada:  Pruebas  de  Aprovechamiento,  Problemas  Apli- 
cados  (Woodcock  &  Sandoval,  1996;  fall:  a  =  .81,  spring:  a  = 
.79).  Raw  scores  on  the  Applied  Problems  assessment  were  used. 
In  addition,  children  were  shown  teddy  bears  and  asked  to  count 
them  with  one-to-one  correspondence  to  measure  emergent  nu¬ 
meracy  skills  (Gelman  &  Gallistel,  1986).  The  highest  number 
counted  in  the  sequence  was  recorded,  with  a  maximum  score  of 
40.  Scores  on  these  two  assessments  were  standardized  (within 
time)  and  averaged  to  create  a  single  measure  of  early  math 
achievement.  Correlations  between  the  two  measures  averaged  .45. 


Classroom-economic  composition.  Measures  of  classroom- 
economic  composition  were  derived  from  the  demographic  ques¬ 
tionnaire  that  was  administered  to  all  children  in  target  children’s 
classrooms.  Three  measures  of  classroom-economic  composition 
were  considered.  The  first  was  a  measure  of  the  mean  classroom 
family  yearly  income  (scaled  in  $10,000  increments).  All  measures 
of  income  obtained  from  participants  in  the  Multi-State,  which 
were  collected  in  2001,  were  escalated  to  2003  dollars  to  be 
consistent  with  the  income  measures  from  SWEEP,  which  were 
collected  in  2003  (Early  et  al.,  2005).  The  second  measure  reflects 
the  percent  of  students  in  the  classroom  from  low-income  families 
(scaled  in  10%  increments),  with  low-income  defined  as  having 
family  income  less  than  200%  of  the  FPL.  We  chose  200%  as  our 
low-income  cutoff  because  states  vary  in  income  thresholds  used 
to  determine  eligibility  for  pre-K  subsidies,  with  cut-offs  ranging 
from  100%  to  over  300%  of  the  FPL.  The  majority  of  states  use 
185%-200%  as  the  requirement  (Barnett  et  al.,  2014).  Moreover, 
recent  political  oratory,  including  President  Obama’s  plan  for  early 
education,  has  argued  for  the  use  of  200%  as  the  income  guideline 
for  subsidized  pre-K  program  enrollment  (e.g.,  Office  of  the  Press 
Secretary,  2013).  We  consider  it  important  that,  in  addition  to  the 
200%  level,  we  tested  100%,  150%,  and  185%  of  the  FPL  in  our 
analyses  and  results  were  consistent  across  all  specifications.  Last, 
we  tested  variability  in  classroom  income  with  a  measure  of  the 
standard  deviation  of  classroom  incomes. 

Child  and  family  characteristics.  Several  child  and  family 
covariates  were  included  in  the  models.6  These  were  derived  from 
interviews  with  the  parents  of  the  target  children.  We  controlled 
for  the  target  child’s  age,  gender,  time  between  assessments, 
whether  the  child  was  assessed  in  English  or  Spanish,  and  race/ 
ethnicity,  which  was  represented  with  dummy  variables  indicating 
whether  the  child  was  White  (reference  group),  African  American, 
Latino,  or  another  race,  which  included  Asian,  Native  American, 
and  multiracial.  Family  structure  was  represented  with  an  indicator 
of  whether  the  target  child  lived  with  a  single  parent  and  a 
continuous  measure  of  the  number  of  children  under  18  years  of 
age  living  in  the  household.  The  highest  level  of  maternal  educa¬ 
tion  was  coded  in  three  categories:  (a)  less  than  a  high  school 
degree  (reference  group),  (b)  a  high  school  diploma  or  GED,  and 
(c)  a  bachelor’s  degree  or  higher.  Family  income  was  measured 
continuously  in  2003  dollars  and  is  expressed  in  $10,000  incre¬ 
ments.  Last,  10  state  indicators  were  included  in  the  models  to 
represent  the  state  in  which  the  target  child  resided  to  control  for 
state  effects. 

Classroom  characteristics.  Several  classroom  characteristics 
were  included  in  the  models  to  reduce  the  likelihood  that  observed 
associations  between  classroom-economic  composition  and  indi¬ 
vidual  achievement  were  driven  by  other  characteristics  of  pre-K 
classrooms  found  to  be  correlated  with  children’s  achievement  and 
classroom  SES.  First,  using  data  from  the  demographic  question¬ 
naire  administered  to  all  classroom  students,  a  measure  of  aggre¬ 
gate  classroom  maternal  education  was  created  by  averaging  total 


5  Initially,  individual  assessments  were  modeled  as  separate  outcomes. 
However,  results  were  similar  across  assessments  so  we  combined  them 
into  a  single  literacy  and  language  measure.  The  same  is  true  for  math 
assessments. 

6  We  tested  for  interactions  between  all  covariates  and  our  economic 
composition  variables.  None  were  significant. 
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years  of  maternal  schooling  across  all  children.7  This  was  a  central 
control  variable  in  our  efforts  to  identify  the  unique  association 
between  classroom-economic  composition  and  achievement,  as 
opposed  to  other  aspects  of  classroom  SES.  Also  using  data  from 
the  demographic  questionnaire,  we  accounted  for  the  racial/ethnic 
composition  of  the  classroom  by  including  a  measure  of  the 
percentage  of  students  in  the  classroom  who  were  African  Amer¬ 
ican,  Latino,  Asian,  Native  American,  or  multiracial.  We  also 
included  dichotomous  indicators  for  whether  the  program  was 
full-day  or  part-day  and  whether  the  head  teacher  possessed  at 
least  a  bachelor’s  degree.  We  also  controlled  for  class  size  and 
teachers’  years  of  teaching  experience.  We  chose  to  include  these 
structural  aspects  of  classrooms  because  they  have  been  identified 
as  potential  quality  indicators  that  may  relate  to  achievement,  but 
also  vary  by  students’  SES  (LoCasale-Crouch  et  al.,  2007;  Mash- 
bum  et  al.,  2008).  The  failure  to  control  for  these  indicators  could 
upwardly  bias  our  estimates  of  peer  effects.  Other  characteristics 
of  classroom  processes  and  interactions  between  teachers  and 
students,  such  as  instructional  time  and  teachers’  emotional  sup¬ 
portiveness,  were  not  included  as  controls  because  they  are  poten¬ 
tial  pathways  through  which  classroom-economic  composition 
may  shape  the  development  of  academic  skills  (e.g.,  Creemers  & 
Reezigt,  1996),  and  including  these  as  controls  could  downwardly 
bias  estimates  of  peer  effects. 

Data  Analysis 

To  address  our  first  research  question,  which  considered 
whether  classroom-economic  composition  is  related  to  student 
achievement,  we  estimated  nonparametric  equations  using  general 
additive  modeling  (GAM;  Hastie  &  Tibshirani,  1990)  in  SAS  9.4. 
GAM  can  be  used  to  identify  thresholds  at  which  associations 
change  by  estimating  the  relation  between  a  predictor  and  outcome 
without  making  assumptions  about  whether  the  nature  of  that 
relation  is  linear,  quadratic,  logarithmic,  etc.  Instead,  functional 
form  is  determined  empirically  by  the  data.  Specifically,  GAM 
allowed  the  data  to  model  the  nonlinear  relations  between  each  of 
our  three  measures  of  classroom-economic  composition  and 
achievement  after  controlling  for  all  covariates.  GAM  output  pro¬ 
vides  a  plot  that  gives  accurate  and  reliable  visual  guidance  as  to 
the  functional  form  that  best  characterizes  associations  and  regions 
where  thresholds  exist  (Setodji  et  al.,  2012).  GAM  models  did  not 
account  for  the  multilevel  nature  of  the  data.  However,  because 
clustering  biases  standard  errors,  not  regression  coefficients 
(Cohen,  Cohen,  West,  &  Aiken,  2003),  and  GAM  was  not  being 
used  to  test  the  significance  of  parameter  estimates,  its  use  here 
was  appropriate. 

GAM  requires  researchers  to  make  judgments  about  the  location 
of  thresholds  and  does  not  provide  significance  tests  of  parameter 
estimates.  Thus,  the  validity  of  identified  thresholds  must  be  tested 
with  other  statistical  methods.  Accordingly,  thresholds  were  tested 
using  spline  regressions,  with  each  potential  threshold  constituting 
a  spline  knot.  These  parameterized  models  (one  for  mean  income, 
one  for  standard  deviation,  and  one  for  low-income  percentage) 
allowed  us  to  examine  whether  the  magnitude  of  classroom- 
economic  composition’s  associations  with  academic  skills  before 
and  after  the  visually  chosen  thresholds  significantly  differed  from 
each  other.  These  models  also  controlled  for  all  covariates. 


Next,  the  statistical  significance  and  size  of  relations  between 
classroom-economic  composition  and  achievement  were  esti¬ 
mated  with  hierarchical  linear  modeling.  First,  we  estimated 
spring  academic  skills  as  a  function  of  each  of  our  economic 
composition  measures  separately,  controlling  for  the  child’s  fall 
scores,  individual-level  child  and  family  characteristics,  and 
other  classroom  characteristics.  Children’s  fall  scores  were 
grand-mean-centered  to  reduce  multicollinearity  in  the  moder¬ 
ation  models  (Cohen  et  al.,  2003).  We  included  a  classroom- 
level  random  effect  to  take  into  account  the  nesting  of  children 
within  classrooms  (Rabe-Hesketh  &  Skrondal,  2008). 8 

Our  final  research  aim  was  to  examine  whether  children’s 
academic  skills  at  baseline  (the  start  of  pre-K)  moderated  rela¬ 
tions  between  classroom-economic  composition  and  achieve¬ 
ment.  To  answer  this  question,  we  added  interactions  between 
children’s  fall  academic  scores  (grand-mean-centered)  and  our 
composition  measures.  The  interactions  were  generated  using 
the  appropriate  functional  form  identified  in  our  initial  GAM 
analyses. 

All  parameterized  models,  including  those  testing  the  validity 
of  observed  GAM  thresholds,  were  estimated  using  “xtmixed” 
in  Stata  13.  Mixed-effects  models  take  into  account  the  nesting 
of  children  within  pre-K  classrooms.  We  constrained  regression 
slopes  to  be  equal  across  classrooms,  but  allowed  intercepts  to 
vary.  By  taking  account  of  the  clustering  of  children  within 
classrooms,  mixed-effects  modeling  provides  more  accurate 
parameter  estimates  and  standard  errors,  thereby  reducing  the 
likelihood  of  Type-I  errors  (Bryk  &  Raudenbush,  1992).  In 
addition  to  estimating  models  with  classroom-economic  com¬ 
position  variables  separately,  we  also  ran  models  with  mean 
income  and  standard  deviation  of  class  incomes  in  the  same 
model  and  standard  deviation  and  low-income  percentage  in  the 
same  model  to  determine  whether  results  changed  when  con¬ 
trolling  for  the  variability  in  income.  Results  were  robust  to  the 
inclusion  of  standard  deviations.  We  could  not  estimate  models 
with  all  three  due  to  multicollinearity  between  mean  income 
and  percentage  of  students  from  low-income  households.  All 
model  assumptions  were  tested  using  standard  techniques 
(Cohen  et  al.,  2003),  and  no  violations  were  observed. 

Missing  Data 

There  were  missing  data  in  the  combined  data  set,  though  all 
cases  had  valid  data  on  some  variables.  The  amount  of  missing 
data  varied  depending  on  whether  the  data  came  from  the 
assessment  of  the  target  child,  the  interview  with  the  target 
child’s  parents,  the  interview  with  the  teachers,  or  the  demo¬ 
graphic  questionnaire  given  to  all  families  of  children  in  the 
target  children’s  classrooms.  Missing  data  from  the  target  chil¬ 
dren  ranged  from  7-9%,  and  variables  created  using  data  from 
their  parents  were  missing  in  0—16%  of  cases.  There  were  very 
little  data  missing  from  teachers — ortly  2-6%. 


We  could  not  create  categorical  measures  of  classroom-aggregate 
maternal  education  level  because  the  individual  parent  responses  from 
classroom  peers  were  not  available  in  the  data  sets,  as  explained  in  detail 
in  the  Missing  Data  section. 

We  were  unable  to  account  for  the  nesting  of  children  within  zip  code 
because  zip  code  data  were  not  available. 
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Missing  data  were  not  missing  completely  at  random 
(MCAR)  according  to  Little’s  test  for  MCAR.  Having  missing 
values  was  related  to  several  other  variables  in  our  analyses, 
including  achievement,  classroom-  and  individual-level  SES, 
and  teacher  characteristics.  To  retain  participants  without  full 
data  in  our  analyses,  missing  data  were  imputed  in  Stata  13 
using  the  multiple  imputation  by  chained  equations  technique  to 
create  20  imputed  datasets  (Royston,  2004).  All  variables  in  our 
analyses,  including  those  related  to  having  missing  values,  were 
used  in  our  imputation  equations.  Imputed  data  were  used  for 
all  analyses  reported  in  our  results  except  GAM  analyses,  in 
which  case  listwise  deletion  was  performed. 

Every  classroom  in  our  sample  had  valid  data  on  classroom- 
economic  composition  and  other  compositional  characteristics 
drawn  from  the  demographic  questionnaire;  no  compositional 
characteristics  were  missing  data  variables.  This  is  because  the 
study  investigators  created  these  classroom  compositional  vari¬ 
ables  based  on  all  available  children  in  each  classroom,  no 
matter  how  many  families  responded.  We  were  unable  to  im¬ 
pute  data  on  the  classroom  compositional  variables  because  the 
raw  data  from  the  demographic  questionnaires  were  not  pro¬ 
vided  in  the  Multi-State  and  SWEEP  datasets  (Early  et  al., 
2005).  The  average  response  rates  for  the  demographic  ques¬ 
tionnaires  completed  by  the  parents  of  all  children  in  target 


classrooms  averaged  70%,  but  ranged  from  10-100%.  It  should 
be  noted  that  the  average  classroom  response  rate  was  high 
compared  with  similar  studies  (e.g.,  Henry  &  Rickman,  2007; 
Justice  et  al.,  2011;  Mashburn  et  al.,  2009). 

Results 

Descriptive  statistics  for  the  sample  can  be  found  in  Table  1  and 
highlight  the  socioeconomic  diversity  of  the  sample,  both  at  the 
child  and  classroom  level. 

Nonlinearities  in  Associations  Between 
Classroom-Economic  Composition  and  Achievement 

GAM  diagnostics  suggested  that  the  functional  form  of  relations 
between  achievement  and  the  standard  deviation  of  classroom 
income  were  linear.  There  were  nonlinearities  in  associations 
between  academic  achievement  and  the  other  two  economic  com¬ 
position  measures  (i.e.,  mean  income  and  low-income  percent). 
With  respect  to  literacy  and  language  skills,  increases  in  the 
percentage  of  students  from  low-income  households  had  a  weak 
association  with  skills  until  25%  of  the  class  were  low  income,  at 
which  point  a  strong  negative  relation  emerged.  The  association 
plateaued  at  45%  low-income.  The  relation  between  literacy  and 


Table  1 


Selected  Descriptive  Statistics  for  Sample  (N  =  2,966) 


Variable 

M  or  % 

SD 

Academic  outcomes 

Literacy  and  language  skills 

Spring 

-.01 

.77 

Fall 

-.02 

.81 

Math  skills 

Spring 

-.02 

.87 

Fall 

-.03 

.88 

Child-level  characteristics 

Age  in  years  at  spring  assessment 

5.05 

.32 

Child  is  male 

49.19% 

.50 

Assessed  in  spring  in  Spanish 

11.63% 

.32 

Child  race 

White 

40.99% 

.49 

African-American 

18.48% 

.39 

Latino 

26.67% 

.44 

Other/multiracial 

13.86% 

.35 

Family  yearly  income 

$32,939.58 

$25,529.09 

Child  lives  in  single  parent  home 

40.44% 

.49 

Maternal  ed. 

Less  than  high  school 

18.80% 

0.39 

High  school 

63.58% 

.48 

Bachelor’s  or  greater 

17.62% 

.38 

Number  of  siblings 

2.43 

2.55 

Classroom-level  characteristics 

Mean  class  income 

$32,285.19 

$18,024.89 

%  Of  class  that  is  low-income  (<200%  FPL) 

69.18% 

.31 

Mean  years  of  maternal  education 

12.77 

1.36 

Full-day  prekindergarten 

43.21% 

.49 

Teacher  has  bachelor’s  degree 

70.75% 

.45 

Teacher’s  years  of  experience 

13.22 

9.23 

Class  size 

18.52 

5.56 

%  Of  class  racial/ethnic  minority 

59.31% 

.37 

Note.  Descriptive  statistics  were  calculated  using  imputed  data.  Academic  outcomes  are  in  standard  score  units. 
Classroom-level  characteristics  were  calculated  at  the  individual  level  for  2,966  children  in  709  classrooms. 
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language  skills  and  mean  income  was  linear.  On  the  other  hand, 
GAM  diagnostics  revealed  nonlinearities  in  links  between  math 
achievement  and  both  mean  classroom  income  and  percent  low 
income.  Specifically,  mean  income  was  not  related  to  math  scores 
until  mean  levels  reached  about  $22,500,  at  which  point  it  had 
positive  links  until  approximately  $62,500,  when  the  association 
plateaued.  When  looking  at  links  between  low-income  percent  and 
math,  children’s  math  scores  remained  relatively  flat  as  the  per¬ 
centage  of  students  from  low-income  households  in  classrooms 
increased,  until  the  classroom  reached  just  over  50%  low  income, 
at  which  point  the  relation  grew  much  steeper  in  the  negative 
direction.  It  again  weakened  after  more  than  70%  of  the  class  was 
low  income.  These  nonlinearities  were  best  fit  by  spline  functions. 
The  results  of  the  parameterized  models  showed  that  observed 
thresholds  were  valid,  meaning  the  size  of  relations  between 
achievement  and  mean  income  and  percent  low  income  differed 
significantly  before  and  after  the  thresholds  (25%  and  45%  for 
low-income  percent  and  literacy  and  language,  $22,500  and 


$62,500  for  mean  income  and  math,  and  52.5%  and  72.5%  for 
percent  low  income  and  math). 

Associations  Between  Classroom-Economic 
Composition  and  Individual  Achievement 

Associations  between  the  standard  deviation  of  classroom  in¬ 
come  and  individual  achievement  were  never  significant,  so  for 
parsimony’s  sake  these  were  dropped  from  the  model  and  are  not 
discussed  further.  Table  2  presents  unstandardized  regression  co¬ 
efficients  from  models  assessing  associations  between  academic 
skills  and  average  family  income  in  the  classroom,  controlling  for 
fall  scores  as  well  as  a  host  of  other  child,  family,  and  classroom 
characteristics.  In  the  text  below,  we  report  standardized  effect 
sizes  to  provide  more  meaningful  interpretation  of  results.  As  seen 
in  Table  2,  when  predicting  literacy  and  language  skills,  mean- 
classroom  income  is  represented  with  a  linear  measure.  Because 
GAM  analyses  revealed  nonlinear  links  between  mean  income  and 


Table  2 


Mixed-Effects  Regression  Predicting  Spring  Academic  Skills  With  Mean  Classroom  Income 


Variable 


Linear 

Mean  class  income 
Spline 

Mean  class  income  <$22,500 
Mean  class  income  =  $22,500-$62,500 
Mean  class  income  >$62,500 
Child  characteristics 
Fall  skills 
Age 

Days  between  assessments 

Child  is  male 

Race 

African  American 
Latino 
Other  race 
Assessed  in  Spanish 
Family  income 
Single-parent  home 
Maternal  education 
High  school 
Bachelor’s  degree 
Number  of  siblings 
Classroom  characteristics 
Full-day  prekindergarten 
Teacher  has  bachelor’s  degree 
Teacher  experience 
Class  size 
%  Minority 

Mean  classroom  maternal  education 
Intercept 

Random-effects  parameters 
Within-classroom  variance 
Between-classroom  variance 


Literacy  and  language 

Math 

Coefficient 

(SE) 

Coefficient 

(SE) 

.004 

(.01) 

-.05 

(.04) 

.07”* 

(.02) 

-.04 

(.04) 

.71*” 

(.01) 

.67*** 

(.01) 

.10” 

(.03) 

.24”* 

(-04) 

.002*” 

(.00) 

.002*” 

(.00) 

-.03 

(.02) 

-.06” 

(.02) 

-.03 

(.03) 

-.04 

(.04) 

.01 

(.04) 

.02 

(.04) 

-.01 

(.03) 

.02 

(.04) 

(.03) 

-.07 

(.05) 

.01 

(.01) 

.01 

(.01) 

-.03 

(.02) 

-.04 

(.03) 

.04 

(.02) 

.01 

(.03) 

.05 

(.04) 

.04 

(.05) 

-.02** 

(.01) 

.01 

(.01) 

.02 

(.03) 

.02 

(.03) 

.06* 

(.03) 

.05 

(-03) 

-.001 

(.001) 

-.001 

(.001) 

.002 

(.002) 

.003 

(.002) 

.001 

(.05) 

.01 

(.06) 

.004 

(.01) 

.01 

(.02) 

—  .94”* 

(.21) 

-1.70*** 

(.28) 

.29*** 

(.0001) 

.15”* 

L00004) 

.Ul 

(.001) 

.02*** 

(.0002) 

Note.  N  -  2,966.  There  were  1,174  children  in  276  classrooms  with  mean  class  incomes  of  <$22,500,  1  505 
children  in  362  classrooms  with  mean  class  incomes  of  $22,500-$62,500,  and  287  children  in  71  classrooms 
with  mean  class  incomes  of  >$62,500.  Coefficients  are  unstandardized.  Dummy  variables  indicating  children’s 
state  of  residence  were  also  included  in  models,  but  results  are  not  shown  here.  Categories  for  child’s  race  are 
compared  with  the  omitted  White  group.  Maternal  education  categories  are  compared  with  the  omitted  “below 
high  school”  group. 


*  p  <  .05.  ’><.01. 


p  <  .001. 
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math  scores,  mean  income  is  represented  with  a  spline  specifica¬ 
tion  with  two  thresholds:  one  at  $22,500  and  one  at  $62,500.  Thus, 
there  are  three  mean-income  terms.  The  first  term  represents  the 
slope  of  the  relation  between  mean  class  income  and  math  achieve¬ 
ment  below  $22,500.  The  second  term  is  the  slope  of  the  relation 
between  mean  income  and  math  between  $22,500  and  $62,500, 
and  the  last  term  is  the  slope  of  the  relation  after  the  $62,500 
threshold  is  reached.  Table  3  shows  results  of  parameterized  mod¬ 
els  looking  at  the  percent  of  children  in  the  classroom  who  come 
from  low-income  households  as  our  indicator  of  classroom- 
economic  composition.  Spring  achievement  was  predicted  using 
spline  terms  with  thresholds  at  25%  low  income  and  45%  low 
income  for  literacy  and  language  and  at  52.5%  and  72.5%  for 
math.  Similar  to  mean  income,  the  three  coefficients  represent, 
respectively,  the  slope  before  the  first  threshold,  between  the  two 
thresholds,  and  after  the  second  threshold. 


Results  indicate  that  average  classroom  income  was  unrelated  to 
children’s  literacy  and  language  skills  (see  Table  2).  However,  there 
was  a  negative  relation  between  the  percentage  of  preschoolers  who 
come  from  low-income  households  in  a  classroom  and  literacy  and 
language  skills  (see  Table  3),  but  only  within  the  25%-45%  low- 
income  range.  More  specifically,  once  a  quarter  of  the  preschoolers  in 
the  class  came  from  low  income,  10%  increases  in  the  percentage  of 
students  from  low  income  were  related  to  a  .06  SD  decrease  in  literacy 
and  language  achievement.  Beyond  the  low-income  threshold  of  45% 
of  the  class,  further  increases  were  unrelated  to  literacy  and  language 
achievement.  While  we  do  not  discuss  relations  between  the  covari¬ 
ates  and  achievement  for  parsimony’s  sake,  it  is  important  to  note  that 
although  there  were  some  significant  associations  between  the  cova¬ 
riates  and  achievement  (most  notably  the  strong  association  between 
fall  and  spring  scores),  many  child-level  characteristics,  including 
children’s  own  household  income,  and  all  of  the  other  classroom-level 


Table  3 


Mixed-Effects  Regression  Predicting  Spring  Academic  Skills  With  Low  Income  Percent  of  Class 


Variable 

Literacy  and  language 

Math 

Coefficient 

iSE) 

Coefficient 

(SE) 

Spline 

<25%  class  low  income 

.03 

(.02) 

25-45%  class  low  income 

-.05* 

(.02) 

^45%  class  low  income 

.01 

(.01) 

<52.50%  class  low  income 

-.01 

(.01) 

52.50-72.50%  class  low  income 

-.08** 

(.02) 

^72.50%  class  low  income 

.03 

(.02) 

Child  characteristics 

Fall  skills 

(.01) 

^r-j  *** 

(.01) 

Age 

.10** 

(.02) 

O/j  *** 

(.04) 

Days  between  assessments 

.002*** 

(.00) 

.002*** 

(.00) 

Child  is  male 

-.03 

(.02) 

-.06** 

(.02) 

Race 

African  American 

-.03 

(.03) 

-.04 

(.04) 

Latino 

.01 

(.04) 

.01 

(.04) 

Other  race 

-.004 

(.03) 

.02 

(.04) 

Assessed  in  Spanish 

(.03) 

-.07 

(.04) 

Family  income 

.01 

(.01) 

.01 

(.01) 

Single-parent  home 

-.03 

(.02) 

-.03 

(.03) 

Maternal  education 

High  school 

.04 

(.02) 

.01 

(.03) 

Bachelor’s  degree 

.04 

(.04) 

.03 

(.04) 

Number  of  siblings 

-.02** 

(.01) 

.01 

(.01) 

Classroom  characteristics 

Full-day  prekindergarten 

.01 

(.03) 

.02 

(.03) 

Teacher  has  bachelor’s  degree 

.06* 

(.03) 

.05 

(.03) 

Teacher  experience 

-.001 

(.001) 

-.0003 

(.001) 

Class  size 

.002 

(.002) 

.003 

(.002) 

%  Minority 

-.001 

(.05) 

-.001 

(.06) 

Mean  classroom  maternal  education 

.01 

(.01) 

.01 

(.01) 

Intercept 

_  £)£)*** 

(.24) 

-1.73** 

(.30) 

Random-effects  parameters 

Within-classroom  variance 

.15*** 

(.00004) 

2^*** 

(.0001) 

Between-classroom  variance 

.02*** 

(.0002) 

Q  ^  *** 

(.001) 

Note.  N  =  2,966.  There  were  388  children  in  95  classrooms  with  <  25%  low-income  students,  325  children 
in  80  classrooms  with  25%-45%  low-income  students,  and  2,253  children  in  534  classrooms  with  mean  class 
incomes  of  >45%  low-income  students.  There  were  856  children  in  209  classrooms  with  <52.5%  low-income 
students,  397  children  in  94  classrooms  with  52.5%-72.5%  low-income  students,  and  1,713  children  in  406 
classrooms  with  mean  class  incomes  of  >45%  low-income  students.  Coefficients  are  unstandardized.  Dummy 
variables  indicating  children’s  state  of  residence  were  also  included  in  models,  but  results  not  shown  here. 
Categories  for  child’s  race  are  compared  to  the  omitted  group  of  white.  Maternal  education  categories  are 
compared  to  the  omitted  group  of  below  high  school. 

><.05.  *><.01.  **><.001. 
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characteristics,  including  mean  maternal  education,  did  not  predict 
literacy  and  language  scores  above  and  beyond  other  predictors  in  the 
model.  This  was  true  for  models  predicting  math  skills  as  well.  This 
is  likely  because  our  models  controlled  for  achievement  skills  in  the 
fall  of  pre-K. 

Moving  on  to  math,  increases  in  mean  classroom  income  positively 
predicted  math  skills,  but  only  when  the  increases  occurred  between 
$22,500  and  $62,500  (see  Table  2).  In  that  range,  $10,000  increases  of 
aggregate  classroom  income  were  related  to  increases  of  .08  SD  in 
math  skills.  Under  $22,500  and  over  $62,500,  income  increases  did 
not  relate  to  improvements  in  math  skills.  There  were  also  associa¬ 
tions  between  low-income  percent  and  math  skills  (see  Table  3).  As 
results  show,  increases  in  the  percentage  of  children  from  low-income 
households  have  negative  links  with  math  achievement  when  pre¬ 
school  classrooms  have  between  52.5%  and  72.5%  of  children  from 
low-income  households.  Put  differently,  increases  in  the  percentage  of 
children  from  low-income  households  did  not  predict  declines  in  math 
skills  until  about  half  of  the  classroom  came  from  low-income  homes. 
At  that  point,  further  increases  in  the  proportion  of  students  from 
low-income  households  related  to  lower  math  skills  (.09  SD  decrease 
in  math  achievement  for  every  10%  increase  in  students  from  low- 
income  households).  This  association  plateaued  when  roughly  72.5% 
of  the  students  had  low  income,  at  which  point  increasing  proportions 
of  students  from  low-income  households  were  not  related  to  math 
skills  declines. 

Moderation  by  Children’s  Initial  Academic  Ability 

Tables  4  and  5  present  the  results  of  models  examining  whether 
associations  between  classroom-economic  composition  and  aca¬ 
demic  skills  were  moderated  by  children’s  academic  skills  at  the 


start  of  pre-K.  All  interaction  models  controlled  for  all  covariates 
(listed  in  the  notes  of  the  tables),  but  for  parsimony’s  sake  only  the 
main  effects  and  interaction  terms  are  presented  in  the  tables. 
Interactions  were  tested  using  the  GAM-identified  functional 
forms. 

Results  revealed  few  significant  interactions  between  individual 
skills  and  classroom-economic  composition,  with  one  notable  ex¬ 
ception.  For  early  literacy  and  language  skills,  relations  between 
mean  classroom  income  and  academic  skills  were  weaker  for 
children  who  began  pre-K  with  better  skills  (see  Table  4).  Figure 
1  illustrates  the  associations  between  mean  classroom  income  and 
literacy  and  language  skills  for  children  with  high  (scoring  1  SD 
above  the  mean),  average  (the  mean)  and  low  (1  SD  below  the 
mean)  fall  skills.  As  initial  literacy  and  language  skills  increase, 
the  relation  between  mean  class  income  and  skills  diminished.  For 
instance,  while  the  main-effect  model  (see  Table  2)  showed  no  link 
between  aggregate  income  and  literacy  and  language  skills,  mod¬ 
eration  results  revealed  that  increases  in  aggregate  classroom  in¬ 
come  predicted  significant  growth  in  literacy  and  language  skills 
for  students  whose  fall  skills  were  0.7  SD  below  the  mean  or  less. 
There  were  no  significant  relations  between  aggregate  classroom 
income  and  literacy  and  language  skills  for  children  with  average 
or  high  fall  skills.  There  was  also  evidence  of  initial  skill  moder¬ 
ation  of  the  link  between  percent  low  income  and  literacy  and 
language  skills,  though  only  for  income  changes  greater  than  45% 
(see  Table  4).  However,  the  practical  significance  of  this  interac¬ 
tion  is  questionable  because  simple  slope  tests  revealed  that  asso¬ 
ciations  between  percent  low  income  and  literacy  and  language 
were  not  significant,  even  for  children  falling  as  much  as  1  SD 
above  or  below  the  mean. 


Table  4 

Moderation  of  Association  Between  Mean  Classroom  Income  and  Spring  Academic  Skill  Level 
by  Children’s  Fall  Skill  Level 


Literacy  and  language  Math 


Variable 

Coefficient 

(SE) 

Coefficient 

(SE) 

Linear 

Main  effects 

Mean  class  income 

.01 

(.01) 

Fall  skills 

.80*** 

(.03) 

Interactive  effect 

Mean  class  income  X  skills 

(.01) 

Spline 

Main  effects 

Mean  class  income  <  $22,500 

-.05 

(.04) 

Mean  class  income  =  $22,500-$62,500 

(.02) 

Mean  class  income  >  $62,500 

-.05 

(.05) 

Fall  skills 

.58*** 

(.08) 

Interactive  effects 

Mean  Class  Income  <  $22,500  X  Skills 

.05 

(.04) 

Mean  Class  Income  =  $22,500-$62,500  X  Skills 

-.02 

(.01) 

Mean  Class  Income  >  $62,500  X  Skills 

.03 

(.04) 

Intercept 

-1.01 

(.21) 

-1.70*** 

(.28) 

Random-effects  parameters 

Within-classroom  variance 

.15*** 

(.00004) 

2Q*** 

(.0001) 

Between-classroom  variance 

.02*** 

(.0001) 

.01*** 

(.001) 

Note.  N  =  2,966.  Coefficients  are  unstandardized.  Models  included  all  covariates  listed  in  Table  2  and  the  state 
indicators. 
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Table  5 

Moderation  of  Association  Between  %  of  Class  Low  Income  and  Spring  Academic  Skills  by 
Children’s  Fall  Skill  Level 


Variable 

Literacy  and  language 

Math 

Coefficient 

(SE) 

Coefficient 

(SE) 

Spline 

Main  effects 

<25%  class  low  income 

.02 

(.03) 

25 — 45%  class  low  income 

-.07** 

(.03) 

<45%  class  low  income 

.01 

(.01) 

<52.50%  class  low  income 

-.02 

(.01) 

52.50-72.50%  class  low  income 

-.07** 

(.03) 

<72.50%  class  low  income 

.03 

(.02) 

Fall  skills 

(.05) 

.61*** 

(.04) 

Interactive  effects 

<25%  Class  Low  Income  X  Skills 

.001 

(.03) 

25<15%  Class  Low  Income  X  Skills 

.03 

(.03) 

<45%  Class  Low  Income  X  Skills 

.02* 

(.01) 

<52.50%  Class  Low  Income  X  Skills 

.01 

(.01) 

52.50-72.50%  Class  Low  Income  X  Skills 

.03 

(.03) 

<72.50%  Class  Low  Income  X  Skills 

-.03 

(.02) 

Intercept 

(.24) 

-1.70*** 

(.30) 

Random-effects  parameters 

Within-classroom  variance 

1  ^*** 

(.00004) 

(.0001) 

Between-classroom  variance 

.02*** 

(.0001) 

q  |  #*# 

(.001) 

Note.  N  —  2,966.  Coefficients  are  unstandardized.  Models  included  all  covariates  listed  in  Table  2  and  the  state 
indicators. 

><.  05.  *><.01.  **><.001. 


Discussion 

Mounting  evidence  on  school  success  demonstrates  that  chil¬ 
dren  who  begin  school  with  strong  early  literacy,  language,  and 
numeracy  skills  are  more  likely  to  experience  continued  academic 
success  compared  with  less  prepared  peers  (e.g.,  Duncan  et  al., 


2007).  Thus,  it  is  vital  to  understand  predictors  of  early  academic 
skills  and  to  pinpoint  high-risk  groups  and  potential  levers  for 
prevention  and  intervention.  Results  from  this  study  suggest  that 
even  after  controlling  for  children’s  achievement  at  the  start  of  the 
pre-K  year,  increased  mean  income  and  smaller  percentages  of 


High  ability  (+  1  SD) 


- Average  ability 


Low  ability  (- 1  SD) 


0  1  2  345678 

Mean  Classroom  Income  ($10,000  increments) 


Figure  1.  Interaction  between  mean  classroom  income  and  fall  language  and  literacy  skills.  This  figure 
presents  slope  coefficients  (unstandardized)  on  mean  classroom  income  for  children  falling  1  SD  above,  at,  and 
1  SD  below  the  mean  of  fall  literacy  and  language  skills.  See  the  online  article  for  the  color  version  of  this  figure. 
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students  from  low-income  households  in  pre-K  classrooms  have 
positive  relations  with  preschool  children’s  achievement,  particu¬ 
larly  math  achievement. 

Our  findings,  which  are  consistent  with  earlier  scholarship  on 
children  in  primary  and  secondary  school  (see  van  Ewijk  & 
Sleegers,  2010),  extend  this  literature  by  focusing  on  preschool. 
Results  also  illustrate  that  frog-pond  theories  of  peer  effects  are  not 
relevant  to  preschool  children.  According  to  frog-pond  theories, 
increases  in  classroom  income  and  decreases  in  disadvantaged 
students  would  be  harmful  to  the  achievement  of  preschoolers 
from  low-income  households  because  more  economically  advan¬ 
taged  peers  would  raise  competition  for  positive  evaluations  and 
cause  the  disadvantaged  students  to  feel  stigmatized  and  inade¬ 
quate.  Studies  finding  increased  economic  integration  to  be  harm¬ 
ful  to  the  achievement  of  children  from  low-income  households 
have,  notably,  focused  on  high  school  students  (e.g.,  Crosnoe, 
2009).  The  processes  purportedly  driving  those  findings  (e.g., 
competition  for  grades,  stigmatization,  and  negative  self¬ 
appraisals)  seem  less  relevant  in  early  childhood  because  grades 
are  often  not  assigned,  children  tend  to  be  unaware  of  their  relative 
socioeconomic  standing,  and  they  are  less  likely  to  make  negative 
self-evaluations  in  reference  to  peers  (Suls,  1986). 

Second,  findings  also  suggest  that  links  between  both  mean 
classroom  income  and  percentage  of  students  from  low-income 
households  and  achievement  are  nonlinear.  The  positive  associa¬ 
tion  between  aggregate  classroom  income  and  math  scores  starts  at 
just  above  $20,000  aggregate  income  and  tapers  at  around 
$60,000.  Similarly,  increases  in  the  percentage  of  children  from 
low-income  households  in  pre-K  classrooms  relate  to  decreased 
math  scores  when  the  class  is  between  approximately  50%-70% 
low  income  and  decreased  literacy  and  language  scores  when  the 
class  is  between  25%-45%  low  income.  These  findings  suggest 
that  there  is  a  “range  of  action”  within  which  modifications  in 
economic  composition  will  produce  changes  in  achievement, 
whereas  increases  or  decreases  in  economic  advantage  under  or 
over  those  thresholds  have  no  links  to  achievement.  Similarly, 
changes  in  the  proportion  of  children  from  low-income  households 
have  no  relation  to  achievement  in  either  pre-K  classrooms  with 
high  concentrations  of  disadvantaged  students  or  with  very  few 
students  from  low-income  households.  Results  add  to  the  limited 
literature  exploring  threshold  effects  or  nonlinearities  when  exam¬ 
ining  peer  effects  on  achievement  (e.g.,  Burke  &  Sass,  2013; 
Hoxby  &  Weingarth,  2005). 

Third,  we  found  little  evidence  that  children’s  academic  skills  at 
the  start  of  the  pre-K  year  moderate  relations  between  classroom- 
economic  composition  and  improvements  in  academic  skills  over 
the  preschool  year.  With  the  exception  of  mean  classroom  income 
and  literacy  and  language  skills,  increasing  classroom  economic 
advantage  within  the  “range  of  action”  has  similar  benefits  to 
achievement  for  all  children,  regardless  of  their  skills  at  the  start  of 
the  school  year.  To  the  authors’  knowledge,  this  is  one  of  the  few 
studies  to  explore  the  moderating  effects  of  individual  ability  on 
associations  between  achievement  and  classroom-economic  com¬ 
position,  as  opposed  to  peers’  aggregate  ability. 

The  results  obtained  from  this  study  are  important  additions  to 
the  literature  on  peer-compositional  effects  because  children’s 
economic  circumstances  are  a  primary  criterion  used  to  determine 
eligibility  for  targeted  pre-K  programs.  Hence,  establishing  links 
between  classroom-economic  composition  and  children’s  develop¬ 


ment  in  preschool  is  highly  relevant  to  current  policy  consider¬ 
ations  regarding  whether  public  funds  are  better  invested  in  tar¬ 
geted  or  universal  pre-K  programs.  These  results  suggest  that 
universal  pre-K  may  narrow  economic  disparities  in  early  achieve¬ 
ment  better  than  targeted  programs,  though  universal  pre-K  access 
does  not  guarantee  economically  integrated  classrooms  because 
things  like  residential  segregation  and  parental  preferences  for 
private  preschool  could  lead  to  economically  homogenous  univer¬ 
sal  classrooms. 

It  is  important  to  note  that  the  effect  sizes  between  both  mea¬ 
sures  of  classroom-economic  composition  and  achievement  during 
the  pre-K  year  are  small,  0.08  SD  per  $10,000  increase  in  income 
between  $22,500  and  $62,500  for  math  and  0.06  and  0.09  SD  per 
10%  decrease  in  students  from  low-income  households  for  literacy 
and  language  (between  25%  and  45%)  and  math  (between  52.5% 
and  72.5%),  respectively.  These  effect  sizes,  however,  are  not 
inconsistent  with  prior  studies  of  classroom  SES  and  achievement 
that  address  omitted- variable  bias  (e.g.,  Hutchison,  2003;  Rivkin, 
2001).  To  put  these  results  in  perspective  using  the  average  tar¬ 
geted  and  universal  classrooms  contained  in  the  Multi-State/ 
SWEEP  (Early  et  al.,  2005),  consider  two  otherwise  similarly 
situated  children  with  average  achievement,  except  one  is  in  a 
targeted  classroom  (average  family  income  is  $25,000  and  81%  of 
the  children  are  low-income)  and  the  other  in  a  universal  class¬ 
room  ($44,000  average  income  and  48%  of  children  are  low- 
income).  We  would  expect  that  the  child  in  the  universal  classroom 
would  see  math-skills  growth  over  the  pre-K  year  almost  0.2  SD 
greater  than  his  or  her  peer  in  the  targeted  pre-K.  We  would  not 
expect  to  see  significant  differences  in  literacy  and  language 
scores,  because  the  difference  in  low-income  rate  between  the  two 
hypothetical  classes  is  not  within  the  25%-45%  range  of  action  for 
associations  between  percent  low  income  and  literacy  and  lan¬ 
guage. 

Differences  in  Results  for  Literacy  and  Language 
Versus  Math  Outcomes 

There  were  some  differences  between  the  results  for  language 
and  literacy  and  math.  First,  links  between  economic  composition 
and  skills  were  stronger  for  math  than  they  were  for  literacy  and 
language.  Indeed,  there  was  no  association  between  aggregate 
income  and  literacy  and  language  skills  in  main-effects  models, 
though  moderation  models  revealed  that  mean  income  was  related 
to  literacy  and  language  skills  for  lower-achieving  students.  Next, 
threshold  effects  of  classroom-economic  composition  were  differ¬ 
ent  for  literacy  and  language  versus  math  skills;  there  were  none 
for  links  between  mean  income  and  literacy  and  language,  and  the 
location  of  thresholds  occurred  earlier  for  low-income  percent  and 
literacy  than  for  math.  Further,  findings  that  children’s  initial 
ability  levels  moderated  associations  between  classroom-economic 
composition  and  achievement  applied  only  to  early  literacy  and 
language  skills. 

k 

These  results  reinforce  developmental  theory  and  research  high¬ 
lighting  the  differences  in  early  language,  literacy,  and  math  skills 
development.  For  example,  math  skill  acquisition  requires  active 
learning  environments  and  direct,  intentional  instruction  and  is  not 
as  conducive  to  learning  via  modeling  as  is  language/literacy 
development  (Ginsburg  et  al.,  2008).  Preschool  curriculum  tends 
to  be  dominated  by  literacy  instruction.  Indeed,  research  has 
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shown  that  preschool  teachers  are  more  disinclined  to  teach  math 
concepts  than  literacy/language  skills  (Blevins-Knabe  et  al.,  2000; 
Ginsburg  et  al.,  2008),  and  this  may  be  especially  so  in  classes  with 
increased  behavioral  and  learning  problems.  In  addition,  reliance 
on  teacher-centered  instruction  tends  to  be  greater  in  classrooms 
with  high  levels  of  disadvantage  (Stipek,  2004),  which  may  neg¬ 
atively  impact  math  learning  more  than  literacy  (Ginsburg  et  al., 
2008).  We  may  have  observed  stronger  links  between  classroom 
composition  and  math  skills  because,  as  classrooms  become  more 
highly  disadvantaged,  math  instruction  may  be  the  first  content 
area  that  preschool  teachers  eliminate.  On  the  other  hand,  we  may 
not  have  observed  relations  between  classroom-economic  compo¬ 
sition  and  literacy  and  language  because  instruction  in  these  areas 
is  such  a  core,  basic  part  of  preschool  curricula  that  classroom- 
economic  composition  may  have  smaller  effects  on  the  amount 
and/or  quality  of  instruction  that  students  receive.  Moreover,  early 
literacy  and  language  development  may  be  less  negatively  im¬ 
pacted  by  the  teacher-centered  instruction  more  common  in  class¬ 
rooms  with  greater  numbers  of  disadvantaged  children  (e.g.,  Gins¬ 
burg  et  al.,  2008;  Stipek,  2004),  which  might  reduce  classroom- 
economic  compositional  effects  on  early  literacy  and  language 
skills. 

Second,  home  environments  may  exert  greater  influence  over 
literacy  and  language  than  math  skills  (Blevins-Knabe  et  al.,  2000; 
LeFevre  et  al.,  2009).  Parents  are  more  likely  to  engage  in  formal 
and  informal  literacy  activities  with  young  children  through  con¬ 
versation,  shared  book  reading,  singing  the  alphabet  song  and 
other  rhymes,  and  teaching  letters  than  they  are  numeracy  activi¬ 
ties  (Blevins-Knabe  et  al.,  2000;  Tudge  &  Doucet,  2004).  There 
tend  to  be  fewer  math  learning  opportunities  in  children’s  home 
environments.  Math  skill  acquisition  may  be  more  strongly  influ¬ 
enced  by  classroom  composition  because  children  rely  more  on 
formal  instruction  in  these  settings.  Thus,  diminished  quantity/ 
quality  of  math  instruction  due  to  increased  classroom  disadvan¬ 
tage  would  disrupt  math  learning  to  a  greater  extent  than  literacy 
or  language  development,  which  is  a  strong  focus  in  the  home 
environment  as  well. 

Similar  processes  may  be  driving  the  moderation  by  baseline 
skills  of  links  between  economic  composition  and  literacy  and 
language  skills.  Children  with  high  initial  literacy  and  language 
skills  are  likely  receiving  more  frequent,  rich,  and  complex  lan¬ 
guage  and  literacy  interactions  at  home  (e.g.,  Bracken  &  Fischel, 
2008;  Brandt,  2001;  Weigel,  Martin,  &  Bennett,  2006).  Children 
with  average  and  below  average  literacy  and  language  skills,  on 
the  other  hand,  may  rely  more  on  pre-K  to  foster  growth  in  this 
domain,  which  could  explain  why  we  uncover  stronger  links 
between  pre-K  classroom-economic  composition  and  literacy  and 
language  skills  for  children  with  less  advanced  skills.  Math,  on  the 
other  hand,  may  be  primarily  learned  at  school  regardless  of 
individual  differences  in  math  skills  because  children  have  more 
limited  opportunities  for  math  learning  at  home  (Tudge  &  Doucet, 
2004).  Further,  because  preschool  teachers  are  generally  less  apt  to 
engage  in  math  instruction  (Ginsburg  et  al.,  2008),  math  skills  may¬ 
be  more  highly  compromised  by  increased  behavioral  or  learning 
problems  in  more  disadvantaged  classrooms.  In  contrast,  even  in 
classrooms  characterized  by  increased  academic  and  behavior 
problems,  teachers  may  still  provide  enough  literacy  instruction  to 
promote  literacy  development  for  those  who  already  have  mas¬ 
tered  basic  skills,  whereas  the  children  who  need  more  intensive 


teaching  (the  lower  achievers)  are  disproportionately  harmed  by 
the  limited  opportunities  for  structured  learning  and  peer  model¬ 
ing.  Alternatively,  there  may  have  been  ceiling  effects  on  the 
literacy  and  language  assessments  for  the  children  who  entered 
pre-K  with  high  literacy  and  language  skills.  Thus,  diminished 
associations  between  classroom-economic  composition  and  liter¬ 
acy  and  language  skills  as  initial  skill  level  increases  may  be 
attributable  to  limited  literacy  and  language  gains  for  these  stu¬ 
dents  due  to  their  mastery  of  letter  identification,  rhyming,  and 
name  writing  at  the  start  of  pre-K. 

Threshold  Effects  of  Classroom-Economic 
Composition  on  Achievement 

The  threshold  effects  of  peer-economic  composition  observed  in 
this  study  make  a  novel  contribution  to  the  literature  on  peer 
effects  because  prior  studies  have  focused  primarily  on  peer  abil¬ 
ity.  The  two  thresholds,  which  defined  the  range  of  action,  differ 
from  prior  studies  exploring  nonlinearities  in  peer  effects.  GAM 
allowed  us  to  identify  these  thresholds  because  it  is  nonparametric 
and  flexible  and  provided  important  guidance  on  where  the  thresh¬ 
olds  occurred.  Previous  studies  have  used  more  rigid  approaches, 
such  as  imposing  a  quadratic  function  (e.g.,  Zimmer  &  Toma, 
2000)  or  using  a  priori  thresholds  (e.g.,  Neidell  &  Waldfogel, 
2010),  which  may  obscure  important  thresholds  that  do  not  fit  the 
specific  nonlinear  function  or  thresholds  that  have  been  identified 
a  priori. 

Next,  in  cases  in  which  nonlinearities  were  identified,  both  a 
tipping  point  and  a  point  of  fade-out  were  observed.  Specifically, 
negative  links  between  low-income  percent  and  literacy  and  lan¬ 
guage  and  math  scores  were  not  observed  until  at  least  25%  and 
52.5%  of  the  class  were  students  from  low-income  households,  re¬ 
spectively.  The  positive  relation  between  average  classroom  income 
and  math  achievement  kicked  in  only  at  the  $22,500  threshold.  These 
tipping-point  or  critical-mass  effects  of  classroom-concentrated  dis¬ 
advantage  or  economic  integration  have  long  been  theorized  (e.g., 
Crane,  1991;  Johnson,  Ladd,  &  Ludwig,  2002),  and  this  investigation 
supports  their  existence  with  respect  to  peer-economic  effects  on  early 
achievement.  In  addition  to  tipping  points,  we  saw  links  between 
classroom-economic  composition  and  achievement  fade  out  after  a 
certain  level  of  mean  income/economic  integration  was  achieved, 
thereby  providing  important  information  about  the  levels  of  economic 
integration  that  may  be  beneficial  for  children’s  academic  skills. 

Threshold  effects  may  represent  the  points  after  which  teachers 
are  no  longer  able  to  adequately  address  children’s  academic  or 
behavioral  issues  without  affecting  the  learning  in  the  classroom. 
For  example,  teachers  can  address  occasional  learning  or  behav¬ 
ioral  problems  without  having  to  sacrifice  the  amount,  quality,  or 
content  of  instruction  or  educationally  enriching  interactions. 
However,  when  the  percentage  of  students  from  low-income 
households  reaches  the  critical  mass,  further  increases  in  the  per¬ 
centage  of  disadvantaged  students  may  alter  classroom  climate. 
Fade-out  of  links  between  percentage  of  students  from  low-income 
households  and  achievement  may  occur  when  classroom  problems 
become  so  pervasive  that  further  increases  in  the  prevalence  of 
at-risk  peers  have  little  practical  impact  on  teachers  and  students, 
and  hence  learning,  in  the  class.  It  is  interesting  to  note  that  the 
critical  mass  before  which  negative  associations  between  percent 
low  income  and  achievement  become  apparent  differs  for  literacy 
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and  language  and  math.  Neidell  and  Waldfogel  (2010)  found  a 
similar  pattern  of  differences  in  the  location  of  threshold  effects  of 
classroom  composition  on  reading  and  math  achievement. 

In  this  study,  the  percent-low-income  compositional  variable  is 
a  stronger  predictor  of  achievement  than  is  mean  classroom  in¬ 
come.  Low-income  percent  reflects  the  concentration  of  econom¬ 
ically  disadvantaged  students  in  the  class,  which  may  be  important 
because  poverty  places  children  at  especially  increased  risk  for 
academic  and  behavioral  difficulties  (Magnuson  &  Votruba-Drzal, 
2009).  Aggregate  classroom  income,  on  the  other  hand,  reveals 
nothing  about  the  ratio  of  disadvantaged  to  more  advantaged 
children.  For  example,  two  classrooms  with  mean  incomes  of 
$33,000  may  have  entirely  different  climates.  One  may  contain 
three  students  living  in  poverty  with  family  incomes  of  $5,000 
annually  and  seven  middle-income  students  in  families  earning 
$45,000.  The  second  may  contain  eight  poor  students  with  family 
incomes  of  $20,000  and  two  upper-income  students  whose  fami¬ 
lies  make  $85,000.  This  classroom  has  more  concentrated  disad¬ 
vantage  than  the  first.  We  might  expect  that  adding  or  subtracting 
one  additional  disadvantaged  child  to  the  latter  classroom  may 
have  little  impact  on  the  climate  if  behavioral  and  learning  prob¬ 
lems  are  already  so  intense  that  the  teacher  and  children  are 
overwhelmed.  But  in  the  first  classroom,  changes  in  aggregate 
income  due  to  adding  or  subtracting  a  highly  disadvantaged  child 
may  result  in  significant  changes  in  achievement  by  appreciably 
altering  the  learning  climate.  The  different  nature  of  these  com¬ 
positional  variables  should  guide  future  research  on  peer  effects.  In 
particular,  measures  capturing  the  proportion  of  poor  versus  non¬ 
poor  students  may  be  the  best  indicators  of  economic  composition. 
But  if  it  is  expected  that  income  effects  exist  more  broadly  across 
the  income  distribution,  using  aggregate  income  may  be  more 
appropriate. 

Situating  Results  in  Broader  Research  on  Pre-K 

This  study  may  also  provide  insight  for  understanding  variation 
in  pre-K-program  evaluations.  For  instance,  universal  pre-K  stud¬ 
ies,  such  as  Gormley,  Phillips,  and  Gayer  (2008)  and  Gormley  and 
colleagues’  (2005)  studies  of  Oklahoma’s  universal  pre-K  program 
and  Weiland  and  Yoshikawa’s  (2013)  study  of  universal  pre-K  in 
Boston,  have  found  moderate  to  large  (.40 — 1  SD )  benefits  (see 
also  Henry  et  al.,  2003).  These  effects  are  substantially  larger  than 
those  uncovered  in  nationally  representative  studies  of  center- 
based  preschool,  which  include  private  center-based  preschools  in 
addition  to  universal  and  targeted  programs,  and  studies  of  targeted 
pre-K,  including  Head  Start  (range  =  0.10-0.25  SD;  e.g.,  Mag¬ 
nuson  &  Waldfogel,  2005;  Votruba-Drzal  et  al.,  2013;  Zill  et  al., 
2003).  Our  findings  suggest  that  increased  classroom-economic 
advantage  in  universal  pre-K  programs  may  play  some  role  in  the 
relatively  larger  effect  sizes  obtained  from  universal  pre-K  evalu¬ 
ations.  Though  it  is  important  to  note  that  there  are  other  differ¬ 
ences  between  the  universal  and  targeted  programs  that  may  be 
important  as  well,  including  curriculum,  program  philosophy, 
teachers’  qualifications  and  salary,  and  evaluation  design.  Of 
course,  universal  pre-K  is  not  synonymous  with  economically 
integrated  classrooms;  children  can  be  provided  universal  pre-K  in 
highly  segregated  classrooms.  However,  the  data  from  our  study 
and  the  Boston  pre-K  program  (Weiland,  2013)  show  that  univer¬ 


sal  programs  do  result  in,  at  least  some,  socioeconomically  diverse 
classrooms. 

Next,  these  results  suggest  that  universal  or  socioeconomically 
integrated  pre-K  programs  may  be  more  beneficial  for  the  early 
academic  development  of  economically  disadvantaged  preschool¬ 
ers  than  targeted  preschool.  Of  course,  we  must  acknowledge  the 
potential  negative  implications  that  increased  integration  may  have 
on  middle-  and  upper-income  children  who  are  enrolled  in  pre¬ 
schools  with  classmates  of  similar  family  incomes.  All  else  equal, 
our  findings  indicate  that  enrolling  these  children  in  more  econom¬ 
ically  diverse  classrooms,  if  the  counterfactual  is  enrollment  in  a 
preschool  with  predominately  advantaged  students,  may  result  in 
fewer  gains  in  their  academic  skills  during  the  pre-K  year.  Our 
results  suggest,  which  is  important  to  note,  that  as  long  as  pre¬ 
school  classrooms  do  not  exceed  25%  low  income  and  mean 
income  does  not  drop  below  $62,500,  adding  disadvantaged  stu¬ 
dents  to  upper-income  classrooms  will  not  negatively  impact 
learning. 

Limitations 

Limitations  of  the  current  study  must  be  acknowledged.  First, 
the  data  are  correlational.  We  made  significant  efforts  to  control 
for  endogeneity  bias  by  including  a  lagged  measure  of  children’s 
achievement  and  several  child,  family,  and  classroom  covariates. 
Yet  results  cannot  be  interpreted  as  causal.  Second,  only  four 
children  per  classroom  were  included  in  the  study.  Third,  response 
rates  for  the  aggregate  classroom-composition  variables  averaged 
70%.  Thus,  although  the  70%  response  rates  for  compositional 
variables  are  better  than  those  observed  in  many  peer-effects 
studies,  almost  one  third  of  data  at  the  classroom-level  are  missing, 
which  may  introduce  substantial  measurement  error  into  our  com¬ 
positional  variables.  However,  the  Multi-State  and  SWEEP  (Early 
et  al.,  2005)  are  two  of  the  few  studies  of  preschool  children  that 
contain  measures  of  classroom-compositional  variables.  Accord¬ 
ingly,  the  use  of  these  measures  marks  an  important  contribution  to 
the  literature.  Fourth,  these  data  are  now  almost  two  decades  old 
and  are  not  nationally  representative,  which  limits  the  generaliz- 
ability  of  findings. 

Fifth,  GAM  analytic  techniques  are  heavily  data  driven,  thus  it 
is  important  for  future  studies  to  validate  the  threshold  effects 
uncovered  in  this  study.  This  is  the  only  way  to  verify  that  our 
results  capture  true  differences  in  associations  between  classroom- 
economic  composition  and  early  math  achievement  and  are  not 
due  to  idiosyncrasies  in  the  sample  used  in  this  study. 

Sixth,  we  have  been  unable  to  disentangle  whether  our  results 
are  driven  by  classroom  characteristics  or  neighborhood  charac¬ 
teristics,  because  they  are  probably  related.  Our  data  did  not 
include  information  on  children’s  neighborhoods,  so  it  is  impos¬ 
sible  to  explore  how  many  of  the  observed  relations  were  attrib¬ 
utable  to  neighborhood-socioeconomic  integration  as  opposed  to 
integration  within  the  classroom.  The  confound  between  economic 
characteristics  in  neighborhood  and  classroom,  however,  is  likely 
less  problematic  in  preschool  than  in  elementary  and  high  school 
because  children  are  not  districted  to  preschools  in  their  neighbor¬ 
hood,  and  families  commonly  cross  school-district  lines  and  travel 
great  distances  for  early  education  programs  (see,  e.g.,  Gordon  & 
Chase-Lansdale,  2001  finding  the  geographic  area  most  accurately 
capturing  the  market  for  center-based  care  to  be  25  miles  from 
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children  s  homes).  Finally,  data  limitations  prevented  us  from  con¬ 
sidering  the  mechanisms  that  mediate  links  between  classroom- 
economic  composition  and  individual  achievement,  such  as  aggregate 
academic  and  behavioral  skills,  teaching  practices,  and  peer  modeling. 
Future  researchers  should  empirically  test  pathways  through  which 
classroom-economic  composition  relates  to  achievement  in  preschool. 

Conclusion 

In  conclusion,  the  results  from  this  study  provide  new  evidence 
of  relations  between  classroom-economic  integration  and  chil¬ 
dren  s  academic  skills  in  pre-K.  These  findings  extend  the  com¬ 
positional  peer-effects  literature  by  showing  that  the  positive  links 
between  increased  economic  advantage  at  the  classroom  level  and 
individual  achievement  that  have  been  documented  with  older 
children  exist  in  preschool  populations  as  well.  Furthermore,  the 
results  may  have  implications  for  the  effectiveness  of  targeted 
versus  universal  pre-K  programs  in  promoting  children’s  school 
readiness.  Universal  programs  enrolling  middle-  and  upper-income 
children  in  addition  to  economically  disadvantaged  children  may 
be  more  effective  than  targeted  pre-K  in  promoting  the  early 
achievement  of  preschoolers  from  low-income  households.  Addi¬ 
tional  research  is  necessary,  however,  to  identify  causal  effects  and 
to  pinpoint  the  mechanisms  by  which  increased  classroom  advan¬ 
tage  predicts  enhanced  academic  achievement.  A  thorough  under¬ 
standing  of  when,  why,  and  for  whom  economic  integration  in 
classrooms  and  schools  is  beneficial  is  necessary  to  inform  pro¬ 
gram  and  policy  decisions  regarding  the  types  of  early  education 
programs  that  will  best  prepare  all  children  to  succeed  in  kinder¬ 
garten  and  beyond. 
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The  transition  to  kindergarten  is  a  critical  period  for  children  and  families,  with  successful  transitions 
setting  the  stage  for  short-  and  long-term  academic  and  social  success.  This  study  explored  the  practices 
used  by  kindergarten  teachers  to  help  ease  children’s  and  families’  transition  into  primary  school  (termed 
“transition  practices”),  and  assessed  their  relationship  to  children’s  social  and  academic  adjustment  to 
school  in  a  nationally  representative  sample  of  children  in  the  United  States  ( N  =  4,900).  On  average, 
kindergarten  teachers  engaged  in  3  transition  practices,  with  outreach  to  parents  and  child  or  parent 
classroom  visits  most  common,  and  structural  changes  to  the  school  schedule  less  frequent.  Private 
schools  and  more  experienced  teachers  engaged  in  more  transition  practices,  whereas  ethnic  and  racial 
minority,  immigrant,  and  urban  children  had  teachers  who  reported  fewer  practices.  Prospective,  lagged 
regression  models  found  that  engagement  in  more  types  of  transition  practices  was  predictive  of 
heightened  prosocial  behaviors  among  children,  but  was  not  associated  with  children’s  attention  or 
academic  outcomes.  Examination  of  specific  types  of  practices  found  that  transition  activities  geared 
toward  parents  were  associated  with  children’s  heightened  academic  skills  in  kindergarten.  These  results 
provide  limited  evidence  to  support  the  “more  is  better”  view  of  transition  practices  and  instead  suggest 
that  specific  types  of  transition  practices  are  linked  to  particular  aspects  of  children’s  functioning. 
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The  transition  to  school  is  an  important  time  in  the  lives  of 
children  and  families.  When  children  enter  kindergarten,  they 
often  face  a  qualitatively  different  environment  than  their  homes 
and  previous  early  education  and  care  programs,  with  increased 
demands  and  expectations  for  both  parents  and  children  (Cowan, 
Cowan,  Ablow,  Johnson,  &  Measelle,  2005;  Pianta,  Cox,  &  Snow, 
2007;  Pianta  &  Kraft-Sayre,  2003;  Ramey  &  Ramey,  2010;  Rimm- 
Kaufman  &  Pianta,  2000).  In  a  national  survey  on  the  transition  to 
kindergarten,  teachers  reported  that  almost  half  (48%)  of  children 
had  some  difficulty  adjusting  to  school,  with  16%  having  serious 
difficulties  and  32%  having  some  difficulties  (Rimm-Kaufman, 
Pianta,  &  Cox,  2000).  School  entry  marks  a  transition  not  just  for 
children  but  also  their  parents,  with  shifting  identities  and  decreas¬ 
ing  opportunities  to  engage  in  their  child’s  day  to  day  activities 
(Cowan  et  al.,  2005). 
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The  prevalence  of  difficulties  adjusting  to  school  is  important, 
given  that  successful  transitions  provide  children  with  the  founda¬ 
tion  for  later  school  success.  Entering  school  is  a  critical  period  of 
cognitive  and  social  development  for  children,  a  time  when  many 
basic  and  foundational  skills  are  taught  and  often  when  student 
records  begin  that  may  follow  the  child  for  the  duration  of  their 
school  experience  (Entwisle  &  Alexander,  1993).  During  this 
period,  “achievement  trajectories”  are  launched  with  positive  early 
school  experiences  and  adjustment  having  implications  for  success 
at  school  entry  and  beyond  (Entwisle  &  Alexander,  1993;  Snow, 
2006).  Research  supports  that  early  school  experiences  are  predic¬ 
tive  of  later  school  achievement  (Pianta  &  Walsh,  1996;  Reynolds, 
2004),  with  children  who  have  positive  experiences  more  likely  to 
report  enjoying  school  and  having  fewer  absences,  thus  potentially 
gaining  more  from  the  available  academic  experiences  that  lead  to 
better  academic  and  social  outcomes  (Ladd,  Buhs,  &  Seid,  2000; 
Ladd  &  Price,  1987;  Pianta  &  Kraft-Sayre,  2003). 

A  Developmental  Ecological  View  of  Transitions 

Given  the  importance  of  smooth  and  positive  transitions,  it  is 
essential  to  better  understand  correlates  of  successful  transitions  to 
kindergarten  and  to  delineate  whether  practices  undertaken  by 
schools  to  support  this  transition  are  associated  with  more  success¬ 
ful  immediate  and  long-term  functioning  for  children.  Rimm- 
Kaufman  and  Pianta’ s  (2000)  developmental  ecological  transition 
to  kindergarten  model  emphasizes  that  successful  transitions  are 
embedded  within  interacting  systems  and  rely  on  connections 
between  families,  early  childhood  settings,  and  the  elementary 
schools  children  are  entering.  The  National  Education  Goals  Panel 
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of  2000,  which  called  for  all  children  to  start  school  “ready  to 
learn,  exemplifies  the  tenets  of  the  ecological  and  dynamic  view 
of  the  transition  to  school,  arguing  that  in  order  to  have  children 
ready  for  school,  there  is  also  a  need  for  “ready  schools,”  “ready 
families,  and  ready  communities”  (National  Education  Goals 
Panel,  1998).  Using  this  model,  “readiness”  becomes  the  property 
of  all  the  pieces  of  the  system,  not  just  the  child  (Pianta  & 
Kraft-Sayre,  2003). 

Ramey,  Ramey,  and  Lanzi  (2006)  suggest  that  there  are  multiple 
components  that  contribute  to  successful  school  transitions,  includ¬ 
ing  that  children  and  parents  have  positive  attitudes  toward  school 
and  learning,  parents  and  key  adults  act  as  partners  in  children’s 
learning,  and  teachers  value  children  as  individuals  and  provide 
developmental^  appropriate  early  experiences.  Although  the  lit¬ 
erature  suggests  that  successful  transitions  need  to  be  the  respon¬ 
sibility  of  all  parties,  including  parents,  early  education  providers, 
kindergarten  teachers,  schools,  and  other  service  providers  (Kagan 
&  Neuman,  1998),  the  onus  to  provide  support  generally  falls  to 
the  elementary  schools  children  are  entering. 

Description  of  School-Based  Transition  Practices 

Transition  practices  implemented  by  schools  can  help  serve  as  a 
bridge  for  children  and  families  as  they  move  into  kindergarten. 
Within  the  developmental  ecological  model,  successful  transition 
activities  foster  positive  relationships  and  should  include  connec¬ 
tions  between  children  and  families  and  schools  (Pianta  &  Kraft- 
Sayre,  2003).  Activities  specifically  targeted  at  children  may  in¬ 
clude  visiting  the  kindergarten  classroom,  meeting  their  new 
teacher,  and  learning  about  what  to  expect  from  school.  Activities 
specifically  targeted  at  families  may  include  attending  confer¬ 
ences,  registration,  and  open  houses,  or  familiarizing  themselves 
with  school  practices  and  policies  through  written  materials.  Some 
of  these  practices  are  fairly  common,  such  as  schools  sending 
information  home  to  families,  organizing  parent  orientations,  and 
having  open  houses,  whereas  other  more  intensive  practices — such 
as  teachers  conducting  home  visits  or  preschool  visits— are  much 
less  common  (Pianta,  Cox,  Taylor,  &  Early,  1999;  Schulting, 
Malone,  &  Dodge,  2005). 

Past  research  has  shown  that  most  school  districts  engage  in 
some  transition  activities,  and  when  offered,  most  families  partic¬ 
ipate  and  find  them  helpful  (La  Paro,  Kraft-Sayre,  &  Pianta,  2003). 
Yet  there  is  limited  research  delineating  which  characteristics  of 
schools,  teachers,  families,  and  children  are  associated  with  the  use 
of  transition  practices.  For  example,  teachers  with  bigger  classes 
and  less  training  on  transitions  may  engage  in  fewer  transition 
practices  (Early,  Pianta,  &  Cox,  1999).  In  addition,  schools  that 
serve  more  high-poverty  and  minority  children  have  been  found  to 
engage  in  fewer  transition  practices  overall,  and  specifically  in  less 
individualized  practices  aimed  at  families  (Early  et  al.,  1999;  Love, 
Logue,  Trudeau,  &  Thayer,  1992;  Pianta  et  al.,  1999;  Schulting  et 
al.,  2005).  This  is  particularly  troubling  because  teachers  in 
schools  with  a  high  composition  of  minority  students  and  higher 
district  poverty  levels  report  increased  difficulties  in  students’ 
adjustment  to  school  compared  with  teachers  from  schools  serving 
more  advantaged  children  (Rimm-Kaufman  &  Pianta,  2000). 
Given  the  limited  knowledge  in  this  arena,  more  comprehensive 
research  is  needed  to  further  identify  the  child,  family,  teacher,  and 


school  characteristics  associated  with  greater  and  lesser  engage¬ 
ment  in  transition  practices. 

Evidence  Relating  Transition  Practices  to  Children’s 
Adjustment  to  School 

Beyond  the  need  for  more  descriptive  information,  a  central 
question  is  whether  transition  practices  help  improve  child  expe¬ 
riences  in  school.  Further,  it  is  important  to  address  which  specific 
types  of  transition  practices  are  most  effective,  and  for  whom. 
Although  theorists  and  educators  have  increasingly  maintained  the 
importance  of  school  transition  practices,  there  is  limited  empirical 
evidence  to  support  their  effectiveness  in  easing  children’s  entry 
into  kindergarten  or  improving  their  cognitive  and  behavioral 
success  after  starting  school.  A  review  of  the  research  on  transition 
practices  for  typically  developing  children  (Eckert  et  al.,  2008) 
found  only  a  few  published  studies,  with  only  one  study  explicitly 
exploring  the  relationship  between  the  transition  practices  used  by 
kindergarten  teachers  and  children’s  academic  outcomes  (Schult¬ 
ing  et  al.,  2005).  Schulting  and  colleagues  (2005)  assessed  transi¬ 
tion  practices  addressing  information  targeted  to  parents  (e.g.,  the 
school  telephoned  or  sent  information  home,  the  teacher  visited  the 
home,  the  parents  visited  the  school,  or  there  was  a  parent  orien¬ 
tation),  practices  targeting  the  child  (e.g.,  the  child  visited  the 
classroom,  or  the  school  had  shortened  days  for  new  kindergar¬ 
teners),  and  an  open  “other”  category.  Using  data  on  a  nationally 
representative  sample  of  over  17,000  children  from  the  Early 
Childhood  Longitudinal  Study,  Kindergarten  Cohort  of  1998, 
Schulting  and  colleagues  summed  the  number  of  school-based 
transition  practices  reported  by  kindergarten  teachers  into  a  total 
score,  finding  that  a  greater  number  of  transition  practices  was 
associated  with  heightened  academic  achievement  scores  among 
children  at  the  end  of  the  school  year. 

In  a  study  of  about  400  children  transitioning  to  formal  school¬ 
ing  in  Finland,  preschool  and  elementary  school  teacher  pairs 
reported  on  the  practices  they  collaborated  on  during  children’s 
preschool  year  to  aid  in  the  transition.  They  found  that  a  greater 
number  of  transition  activities  was  associated  with  heightened 
growth  in  children’s  reading,  writing,  and  math  skills  through  the 
first  year  of  elementary  school  (Ahtola  et  al.,  2011).  These  results 
are  consistent  with  and  extend  Schulting  and  colleagues’  (2005) 
results  by  assessing  growth  in  children’s  outcomes.  However,  both 
of  these  studies  focused  only  on  academic  outcomes,  and  did  not 
consider  associations  between  transition  practices  and  other  key 
aspects  of  successful  school  transitions,  such  as  children’s  initial 
adjustment  to  school  or  longer  term  social  and  behavioral  func¬ 
tioning  through  kindergarten. 

Additional  research  has  assessed  the  association  between  tran¬ 
sition  practices  by  preschool  teachers  and  child  functioning  in 
kindergarten.  LoCasale-Crouch,  Mashburn,  Downer,  and  Pianta 
(2008)  studied  the  transition  to  school  for  a  sample  of  approxi¬ 
mately  320  children  attending  public  prekindergarten  programs  in 
the  United  States,  finding  a  positive  link  between  the  number  of 
transition  activities  used  by  preschool  teachers  (e.g.,  children, 
teachers,  and/or  parents  visiting  kindergarten,  sharing  information 
between  preschool  and  kindergarten  teachers)  and  children’s  so¬ 
cial,  self-regulation,  and  academic  skills  in  the  fall  of  their  kin¬ 
dergarten  year. 
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In  addition  to  assessing  whether  the  breadth  of  transition  prac¬ 
tices  was  supportive  of  heightened  child  functioning,  these  studies 
also  assessed  whether  specific  individual  activities  were  more  or 
less  important  in  promoting  children’s  success  in  kindergarten.  For 
example,  Schulting  and  colleagues  (2005)  found  that  parents  and 
children  visiting  the  kindergarten  classroom  before  the  school  year 
started  were  the  only  individual  transition  practices  to  significantly 
predict  heightened  academic  scores  among  children  (Schulting  et 
al.,  2005).  These  results  reiterate  arguments  from  other  research 
suggesting  that  collaboration  between  schools  and  families  im¬ 
proves  children’s  school  success  in  realms  including  more  positive 
attitudes  toward  school,  better  attendance,  higher  grades,  and 
higher  graduation  rates  (Dearing,  Kreider,  &  Weiss,  2008;  Hen¬ 
derson  &  Berla,  1994;  Pomerantz,  Moorman,  &  Litwack,  2007).  In 
contrast,  the  studies  that  explored  activities  engaged  in  by  pre¬ 
school  teachers  found  that  sharing  information  with  kindergarten 
teachers  about  curriculum  or  individual  children  were  the  activities 
most  predictive  of  better  child  outcomes  (Ahtola  et  al.,  2011; 
LoCasale-Crouch  et  al.,  2008). 

A  final  central  question  in  the  literature  concerns  individual 
differences  among  children,  that  is,  asking  whether  kindergarten 
transition  practices  are  more  important  for  certain  children  than  for 
others.  For  example,  Schulting  and  colleagues  (2005)  hypothe¬ 
sized  that  transition  practices  might  be  most  protective  for  children 
facing  greater  risks  of  poor  school  adjustment  because  of  limited 
family  economic  resources.  Consistent  with  past  descriptive  re¬ 
search  on  transition  practices,  this  study  reported  that  children 
from  families  with  low  socioeconomic  status  received  the  fewest 
number  of  transition  practices,  yet  gained  the  most  from  them. 
LoCasale-Crouch  and  colleagues  (2008)  found  similar  results 
when  exploring  whether  transition  practices  moderated  the  rela¬ 
tionship  between  risk  factors  (i.e.,  maternal  education,  poverty 
level,  and  race)  and  child  outcomes.  They  found  that  transition 
practices  were  more  strongly  related  to  children’s  functioning  for 
children  facing  risk  factors  than  for  their  more  advantaged  peers. 

Research  Goals 

The  present  study  sought  to  expand  the  limited  literature  in  this 
arena  by  assessing  the  transition  to  kindergarten  in  a  large,  nation¬ 
ally  representative  sample  of  American  children  bom  in  2001  and 
followed  from  infancy  through  the  transition  to  kindergarten  in 
2006  or  2007.  The  use  of  longitudinal  data  allowed  us  to  assess 
children’s  family  characteristics,  early  educational  experiences, 
and  functioning  prior  to  kindergarten  entry,  important  information 
for  delineating  the  populations  most  and  least  likely  to  experience 
transition  practices,  and  also  essential  for  helping  to  isolate  unique 
associations  between  transition  practices  and  children’s  emotional, 
behavioral,  and  academic  functioning  after  starting  kindergarten. 
This  is  an  important  contribution  to  the  literature,  as  past  studies  on 
U.S.  samples  did  not  account  for  children’s  earlier  functioning 
before  school  entry. 

One  set  of  goals  of  the  current  study  was  descriptive.  First,  we 
sought  to  provide  an  updated  profile  of  kindergarten  transition 
practices,  exploiting  the  generalizability  of  a  nationally  represen¬ 
tative  sample  and  considering  a  broad  set  of  transition  practices 
engaged  in  by  kindergarten  teachers.  Based  on  prior  research 
(Pianta  et  al.,  1999;  Schulting  et  al.,  2005),  we  expected  that 
practices  providing  information  to  parents  and  organizing  child 


classroom  visits  would  be  most  common.  Second,  we  sought  to 
provide  a  rich  description  of  child,  parent,  teacher,  and  school 
characteristics  associated  with  teachers’  engagement  in  transition 
practices.  We  expected  that  less  experienced  teachers  and  those 
serving  more  disadvantaged  children  would  report  fewer  transition 
practices,  although  it  was  difficult  to  develop  further  hypotheses 
concerning  child  and  family  characteristics  from  the  limited  re¬ 
search  base. 

The  third  and  most  substantive  goal  of  this  study  was  to  extend 
evidence  on  connections  between  transition  practices  and  chil¬ 
dren’s  successful  kindergarten  functioning.  In  particular,  we 
sought  to  extend  prior  literature  |?y  (a)  considering  not  only  chil¬ 
dren’s  academic  skills  but  also  their  behavioral  adjustment  and 
social  skills  in  kindergarten;  (b)  adjusting  for  children’s  prior 
functioning  as  well  as  child,  family,  and  early  childhood  education 
(ECE)  experiences  in  order  to  isolate  associations  between  transi¬ 
tion  practices  and  children’s  growth  in  functioning;  and  (c)  assess¬ 
ing  whether  a  “more  is  better”  model  best  explains  links  between 
transition  practices  and  children’s  functioning,  or  rather  whether 
certain  types  of  transition  practices  appear  most  effective  at  sup¬ 
porting  particular  aspects  of  children’s  success  in  kindergarten.  We 
expected  that  more  practices  would  predict  better  child  outcomes, 
and  that  parent  and  child  visits  to  the  kindergarten  class  would  be 
the  most  important  individual  practices.  In  addition,  following 
tenets  of  the  developmental  ecological  transition  to  kindergarten 
model  (Rimm-Kaufman  &  Pianta,  2000)  and  evidence  that  transi¬ 
tion  practices  are  more  important  for  at-risk  children  (LoCasale- 
Crouch  et  al.,  2008;  Schulting  et  al.,  2005),  we  assessed  whether 
family  income  moderated  associations  between  transition  practices 
and  children’s  functioning.  We  expected  that  transition  practices 
helping  to  familiarize  children  and  families  with  the  norms  and 
practices  of  kindergarten  would  be  particularly  important  for  the 
functioning  of  children  from  low-income  families. 

Method 

Participants 

Data  were  drawn  from  the  Early  Childhood  Longitudinal  Study, 
Birth  Cohort  (ECLS-B),  a  longitudinal  multicomponent  study  fol¬ 
lowing  a  nationally  representative  sample  of  approximately  10,700 
children  (the  ECLS-B  requires  that  all  As  be  rounded  to  the  nearest 
50)  bom  in  the  United  States  in  2001  from  infancy  through 
kindergarten  entry  (Chemoff,  Flanagan,  McPhee,  &  Park,  2007). 
Children  who  died  or  were  adopted  prior  to  9  months  of  age  and 
children  born  to  mothers  under  15  years  of  age  were  excluded  from 
the  sample.  The  ECLS-B  collected  five  or  six  waves  of  data  from 
primary  caregiver  interviews  (with  the  child’s  mother  in  98%  of 
cases)  and  child  assessments  when  children  were  (on  average)  10 
months,  2  years,  4  years,  5  years,  and,  for  the  approximately  25% 
of  children  not  yet  in  kindergarten  by  Age  5,  6  years  of  age.  The 
response  rate  was  74%  at  the  first  wave,  followed  by  rates  of  93%, 
91%,  92%,  and  92%  among  children  remaining  in  the  sample  at 
each  wave.  Data  for  this  study  were  drawn  from  children’s  kin¬ 
dergarten  year  (Wave  4  or  5)  and  the  year  prior  to  kindergarten 
(Wave  3  or  4).  Kindergarten  teachers  were  interviewed  in  Waves 
4  or  5  (response  rates  of  74%  and  76%),  and  ECE  providers  were 
interviewed  for  children  in  regular  center  or  home-based  ECE 
settings  in  Waves  3  or  4  (response  rates  of  70%  and  87%). 
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Because  we  are  focused  on  transition  practices  led  by  kinder¬ 
garten  teachers  and  children  s  functioning  in  kindergarten,  the 
analytic  sample  focused  on  children  who  remained  in  the  sample  at 
kindergarten  and  had  kindergarten  teacher  interview  data,  approx¬ 
imately  5,050  children.  Of  these  children,  about  150  were  excluded 
because  they  were  not  first-time  kindergarteners  or  did  not  attend 
a  kindergarten  classroom  at  school  entry  (e.g.,  went  directly  to  first 
grade  or  were  in  ungraded  classroom).  This  resulted  in  approxi¬ 
mately  4,900  children  in  the  study’s  analytic  sample.  Children  in 
the  sample  were  51%  White  and  51%  male.  They  averaged  68 
months  old  at  kindergarten  entry,  89%  attended  public  schools, 
and  75%  were  enrolled  in  full-day  kindergarten  classrooms.  At  the 
kindergarten  wave,  there  was  substantial  geographic  dispersion  of 
the  sample,  resulting  in  approximately  one  study  child  per  school 
(Snow  et  al.,  2009).  It  is  essential  to  note  that  the  ECLS-B 
calculated  weights  that  adjust  for  differential  sampling  and  nonre¬ 
sponse,  as  well  as  attrition  over  the  waves.  To  adjust  for  these 
factors  and  properly  estimate  standard  errors,  given  the  complex 
sampling  design,  all  analyses  included  90  replicate  weights 
(wk45tl-wk45t90)  using  jackknife  replication  methods  as  sug¬ 
gested  by  the  ECLS-B  (Snow  et  ah,  2009).  This  set  of  weights  was 
carefully  chosen  from  all  weights  created  by  the  ECLS-B  to  fit  our 
exact  analytic  sample  (children  starting  kindergarten  in  2006  or 
2007  with  parent  and  kindergarten  teacher  interview  data),  adjust¬ 
ing  for  teacher  nonresponse  as  well  as  child  and  parent  attrition 
through  the  waves.  The  use  of  these  weights  allows  us  to  gener¬ 
alize  results  to  all  children  bom  in  the  United  States  in  2001. 

Prior  to  conducting  analyses,  we  explored  the  presence  of  miss¬ 
ing  data  in  the  analytic  sample  of  approximately  4,900  children. 
Item-level  missing  data  ranged  from  0%  to  20%.  Little’s  missing 
completely  at  random  (MCAR)  test  was  performed  in  SPSS  22, 
and  revealed  that  missing  values  in  the  analytic  sample  were  not 
MCAR,  x2  =  2,333.72  (1064),  p  =  .00.  Further,  observed  vari¬ 
ables  were  related  to  missingness  (with  a  general  pattern  suggest¬ 
ing  that  children  with  lower  functioning  and  few  social  and  eco¬ 
nomic  resources  were  more  likely  than  their  peers  to  have  missing 
data),  supporting  the  appropriateness  of  multiple  imputation  to 
address  missing  data  (Little,  1988).  Missing  data  were  imputed 
using  multiple  imputation  by  chained  equations  in  Stata  12.1 
(Royston,  2005)  to  create  20  complete  data  sets.  The  imputation 
models  included  all  variables  described  in  the  measures  section, 
incorporating  ordinary  least  squares  (OLS),  logit,  ordered  logit, 
multinomial  logit,  and  Poisson  modeling  techniques  as  appropri¬ 
ate,  depending  upon  the  scaling  of  the  variables.  Following  impu¬ 
tation,  analyses  were  run  using  the  mi  estimate  command  in  Stata 
in  order  to  aggregate  results  and  properly  estimate  standard  errors 
across  the  imputed  data  sets. 

Measures 

Children’s  behavioral  school  adjustment.  Children’s  be¬ 
havioral  adjustment  and  functioning  in  kindergarten  were  assessed 
via  parent  and  kindergarten  teacher  reports.  Parents  reported  on 
children’s  short-term  adjustment  to  kindergarten  through  six  items 
assessing  children’s  behaviors  in  the  first  2  weeks  after  school 
entry.  Items  rated  how  often  the  child  complained  about  school, 
was  reluctant  to  go  to  school,  pretended  to  be  sick,  said  good 
things  about  school,  reported  liking  school,  and  looked  forward  to 
going  to  school,  on  a  scale  from  1  to  3,  with  items  recoded  so  that 


higher  scores  showed  more  positive  adjustment  to  school  (Cher- 
noff  et  al.,  2007).  Factor  analysis,  conducted  in  STATA  using  a 
polychoric  correlation  matrix  to  account  for  the  ordinal  nature  of 
the  variables,  revealed  that  items  loaded  onto  one  factor,  and,  thus, 
items  were  averaged  into  an  adjustment  scale  (a  =  .69).  Teachers 
reported  on  a  broader  set  of  children’s  behaviors,  with  ratings 
completed,  on  average,  2.3  months  after  the  start  of  school.  Teach¬ 
ers  reported  on  a  series  of  items  drawn  from  well-validated  mea¬ 
sures,  including  the  Preschool  and  Kindergarten  Behavior 
Scales — Second  Edition  (Merrell,  2003),  the  Social  Skills  Rating 
Scales  (Gresham,  Elliott,  &  Black,  1987),  and  the  Family  and 
Child  Experiences  Study.  Teachers  rated  the  frequency  of  the 
child’s  engagement  in  behaviors  on  5-point  scales  ( never  to  very 
often).  We  used  composite  measures  validated  in  prior  research 
with  the  ECLS-B  (Coley,  Votruba-Drzal,  Collins,  &  Cook,  2016; 
Coley,  Votruba-Drzal,  Miller,  &  Koury,  2013),  and  reestablished 
by  factor  analyses  in  Stata  using  polychroic  correlation  matrices,  to 
assess  children’s  prosocial  and  attention  skills.  Prosocial  skills 
consisted  of  an  average  of  six  items  assessing  behaviors  such  as 
making  friends,  sharing,  and  comforting  others  (a  =  .82).  Atten¬ 
tion  skills  were  assessed  with  five  items  delineating  children’s 
attention,  independence,  task  completion,  and  eagerness  to  learn 
(a  =  .83). 

Children’s  academic  adjustment.  Children’s  cognitive  skills 
were  assessed  after  kindergarten  entry  (2.3  months,  on  average, 
after  kindergarten  entry)  through  direct  assessments.  Assessments 
incorporated  items  drawn  from  well-validated,  standardized  instru¬ 
ments  such  as  the  Peabody  Picture  Vocabulary  Test — Third  Edi¬ 
tion  (L.  M.  Dunn  &  Dunn,  1997),  the  PreLAS  2000  (Duncan  & 
DeAvila,  1998),  the  Preschool  Comprehensive  Test  of  Phonolog¬ 
ical  &  Print  Processing  (Lonigan,  Wagner,  Torgeson,  &  Rashotte, 
2002),  and  the  Test  of  Early  Mathematics  Ability  (3rd  ed.;  Gins- 
burg  &  Baroody,  2003).  ECLS-B  statisticians  completed  extensive 
data  cleaning  and  validation  work  on  these  cognitive  measures, 
using  Item  Response  Theory  methods  to  create  composites,  which 
we  used  in  our  analyses  (see  Snow  et  al.,  2009,  for  details).  The 
early  reading  assessment  (a  =  .92)  consisted  of  74  items  that 
measured  early  reading  and  language  skills,  including  letter  knowl¬ 
edge,  word  recognition,  print  conventions,  and  phonological 
awareness.  The  math  assessment  (a  =  .92)  consisted  of  58  items 
focused  on  number  sense,  properties,  operations,  and  probability. 
Children’s  kindergarten  behavioral  and  academic  adjustment  mea¬ 
sures  were  used  as  dependent  variables  in  the  child  outcome 
models. 

Reports  of  children’s  prosocial  and  attention  skills  and  direct 
assessments  of  children’s  cognitive  skills  were  also  collected  at  the 
preschool  wave.  Following  prior  research  (e.g.,  Coley  et  al.,  2016) 
preschool  ratings  of  children’s  prosocial  and  attention  skills  were 
reported  by  the  provider  of  their  main  educational  environment. 
For  seventy  seven  percent  of  the  sample,  early  education  providers 
reported  children’s  behaviors  (prosocial  skills,  six  items,  a  =  .81; 
attention  skills,  five  items,  a  =  .83).  For  the  remaining  23%  who 
were  not  in  an  early  education  setting  at  Age  4,  parents  reported  on 
child  behaviors  (prosocial  skills,  six  items,  a  —  .80;  attention 
skills,  four  items  [one  item  specifically  addressing  attention  in 
school  was  not  reported  by  parents],  a  =  .65).  Children’s  early 
reading  (a  =  .84)  and  mathematics  (a  =  .89)  skills  were  directly 
assessed  using  the  same  measures  described  for  the  kindergarten 
wave.  Preschool  cognitive  and  behavioral  measures  were  used  as 
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independent  variables  in  the  models  predicting  kindergarten  tran¬ 
sition  practices  and  as  covariates  (lags  of  the  dependent  variable) 
to  adjust  for  prior  functioning  in  the  child  outcome  models  pre¬ 
dicting  kindergarten  adjustment. 

School  transition  practices.  In  the  teacher  questionnaire,  kin¬ 
dergarten  teachers  reported  whether  they  or  others  in  their  school 
engaged  in  seven  different  specific  practices  to  make  the  transition  to 
kindergarten  less  difficult  for  children  in  the  study  child’s  class.  . These 
included  (a)  phone/send  home  information  about  the  kindergarten 
program  to  the  parents;  (b)  invite  parents  to  the  school  for  orientation 
prior  to  the  start  of  the  school  year;  (c)  have  preschoolers  spend  some 
time  in  the  kindergarten  classroom  prior  to  school;  (d)  have  parents 
and  children  visit  kindergarten  prior  to  the  start  of  the  school  year;  (e) 
conduct  home  visits  to  the  homes  of  children  at  the  beginning  of  the 
school  year;  (f)  shorten  school  days  at  the  beginning  of  the  school  year 
for  kindergarteners;  and  (g)  stagger  school  entry  so  that  kindergarten¬ 
ers  start  the  school  year  in  smaller  groups  before  meeting  with  the  full 
class.  Each  item  was  scored  yes  or  no.  Individual  items  were  used  as 
separate  indicators  and,  following  prior  research  (LoCasale-Crouch  et 
al.,  2008;  Schulting  et  ah,  2005),  also  were  summed  into  a  total 
activities  index  variable  to  delineate  the  breadth  of  transition  activities 
engaged  in  by  the  teacher.  It  is  important  to  note  that  teachers  reported 
on  the  activities  they  engaged  in;  the  ECLS-B  did  not  gather  data  on 
whether  individual  children  and  parents  participated  in  the  offered 
activities.  The  full  summed  activity  index  was  used  as  the  dependent 
variable  in  the  first  model,  with  child,  family,  and  school  character¬ 
istics  predicting  transition  practices.  The  summed  activity  index  and 
individual  items  were  also  used  as  the  main  independent  variables  of 
interest  in  the  models  predicting  child  outcomes. 

Child  and  family  characteristics.  The  ECLS-B  assessed  a  rich 
set  of  child  and  family  characteristics,  which  were  used  both  as 
predictors  of  transition  practices  and  as  covariates  in  models  predict¬ 
ing  children’s  successful  transition  to  kindergarten.  These  include 
indicator  variables  of  children’s  male  gender,  whether  they  were  part 
of  a  multiple  child  birth,  whether  they  were  bom  with  low  birth 
weight  (less  than  2,500  g/5.5  pounds),  and  whether  they  were  ever 
diagnosed  with  a  cognitive,  behavioral,  or  physical  disability  prior  to 
kindergarten  (all  reported  by  parents  and  coded  as  0  or  1).  Children’s 
preschool  type  was  reported  by  parents  and  teachers  and  coded  into 
mutually  exclusive  categories  of  no  nonparental  preschool  experience, 
home-based  in  child’s  home,  home-based  in  another  home,  center- 
based  in  a  school,  and  center-based  in  another  location.  Parental  race 
and  ethnicity  and  immigrant  status  were  combined  into  a  set  of 
mutually  exclusive  groups  delineating  children  of  native-born  Whites, 
native-born  Blacks,  native-born  Hispanics,  Native  American  Indians, 
and  native-born  children  of  other  races  (including  parents  of  different 
races,  and  “other”),  Hispanic  immigrants,  Asians  (94%  of  whom  were 
immigrants),  and  other  immigrants.  An  additional  indicator  desig¬ 
nated  families  whose  primary  language  was  not  English.  Mother’s  age 
at  first  childbirth  was  also  reported  as  well  as  highest  level  of  parental 
education,  which  was  coded  into  mutually  exclusive  groups  (less  than 
a  high  school  diploma,  high  school  diploma  or  GED,  some  college  or 
vocational  training,  or  bachelor’s  degree  or  higher).  Each  family’s 
community  was  coded  as  urban,  suburban,  or  rural. 

Other  characteristics  of  parents  and  families  that  may  shift  over 
time  were  assessed  at  the  preschool  wave,  the  wave  just  prior  to 
each  child’s  kindergarten  entry.  These  included  indicators  of  ma¬ 
ternal  employment,  maternal  marital  status,  and  maternal  depres¬ 
sion  (assessed  using  a  modified  version  of  the  Center  for  Epide¬ 


miological  Studies  Depression  Scale  [Radloff,  1977;  12  items,  a  = 
.89],  dichotomized  to  designate  whether  or  not  scores  were  in  the 
moderately  to  severely  depressed  range),  all  coded  0  or  1,  as  well 
as  continuous  variables  of  total  family  income  (in  units  of 
$10,000),  the  number  of  nonparental  adults  in  the  household,  and 
the  number  of  children  in  the  household. 

Because  instability  in  children’s  home  environments  may  affect 
their  successful  transition  to  kindergarten  (Cowan  et  al.,  2005),  we 
also  considered  each  of  these  measures  assessed  at  the  kindergarten 
wave,  coded  to  indicate  whether  or  not  a  transition  had  occurred  (e.g., 
a  transition  from  married  to  single  or  from  single  to  married;  a  loss  or 
gain  in  the  number  of  children  in  the  household).  Because  transitions 
in  and  of  themselves,  whether  they  are  commonly  seen  as  “positive” 
or  “negative,”  all  cause  disequilibrium  and  require  adjustment  in  order 
for  healthy  development  to  occur  (Cowan  &  Heming,  2005;  Erikson, 
1950),  we  coded  change  in  either  direction  as  “1,”  with  stability  coded 
“0.”  Following  prior  research,  which  has  found  that  aspects  of  insta¬ 
bility  tend  to  co-occur  (Kull,  Coley,  &  Lynch,  2015;  Vemon-Feagans, 
Garrett-Peters,  Willoughby,  Mills-Koonce,  &  the  Family  Life  Project 
Key  Investigators,  2012),  indicators  of  instability  were  summed  into 
a  family  instability  index. 

Kindergarten  teacher  and  school  characteristics.  Kinder¬ 
garten  teachers  reported  on  a  number  of  school  and  classroom 
characteristics,  including  whether  the  school  was  public  or 
private,  whether  the  study  child  was  attending  a  full-day  or 
half-day  program,  the  size  of  the  class,  and  their  years  teaching. 
Parents  reported  whether  the  study  child  had  siblings  in  the 
same  school  and  the  distance  from  their  home  to  the  child’s 
school  (less  than  1  mile,  1-2.5  miles,  2.6-5  miles,  5.1-10 
miles,  or  10  or  more  miles). 

The  child,  family,  and  school  characteristics  were  chosen  based 
on  theory  and  research  supporting  their  relation  to  early  school 
success  (Ahtola  et  al.,  2011;  Cowan  et  al.,  2005;  LoCasale-Crouch 
et  al.,  2008;  Rimm-Kaufman  &  Pianta,  2000;  Schulting  et  al., 
2005).  Prior  to  inclusion,  bivariate  correlations  were  estimated  to 
explore  whether  each  covariate  was  associated  with  both  transition 
practices  and  the  child  outcomes.  All  covariates  included  in  the 
models  were  significantly  correlated  with  one  or  more  transition 
practices  and  one  or  more  child  outcome  measures  (see  the  online 
supplemental  materials). 

Analytic  Plan 

The  first  research  aim  of  this  study  was  to  gain  a  better  under¬ 
standing  of  the  prevalence  of  kindergarten  transition  practices,  which 
we  address  through  descriptive  statistics.  The  second  aim,  assessing 
how  child,  family,  and  school  characteristics  were  associated  with 
transition  practices,  was  addressed  using  a  Poisson  regression  model 
in  which  all  of  the  child,  family,  and  school  characteristics  were  used 
as  predictors  of  the  transition  practices  index  score.  These  predictors 
included  the  four  measures  of  children’s  functioning  in  the  preschool 
wave  (reading  skills,  math  skills,  prosocial  behaviors,  and  attention 
skills).  To  address  concerns  over  multicollinearity  and  for  the  sake  of 
parsimony,  the  reading  and  math  scores  were  averaged  into  one 
cognitive  skills  composite  ( r  —  .79),  and  the  prosocial  and  attention 
skills  measures  were  averaged  into  one  behavioral  composite  (r  = 
.59)  for  this  analysis.  As  a  robustness  check,  the  model  was  also  run 
with  the  individual  functioning  variables  as  predictors,  and  the  results 
did  not  change.  The  more  parsimonious  model  is  presented  in  Table  1 . 
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Table  1 

Descriptive  Statistics  and  Poisson  Regression  Model  Predicting  School  Transition  Practices  From  Child,  Family,  School,  and 
Teacher  Characteristics 


Poisson  regression  model 
predicting  transition 
Descriptive  statistics  practices  sum  index 


Variables 

Mean  or  % 

(■ SD ) 

IRR/B 

(SE) 

Transition  practices  sum  index 

3.29 

(1.14) 

Child  and  family  characteristics 

Preschool  wave  cognitive  functioning  (math  and  reading) 

29.15 

(9.64) 

1.01/.01 

(.01) 

Preschool  wave  behavioral  functioning  (attention  and  prosocial  skills) 

3.88 

(.59) 

1.00/.00 

(.01) 

Child  disability 

15% 

1.05/. 05** 

(.02) 

Male 

51% 

_ 

.99/— .01 

(.01) 

Twin 

3% 

_ 

.99/— .01 

(.02) 

Low  birth  weight 

8% 

_ 

,97/-.03  + 

(.02) 

In  kindergarten  in  2007 

27% 

_ 

1.00/.00 

(.03) 

Age  at  kindergarten  entry  (months) 

68.13 

(4.39) 

1.00/.00 

(.00) 

Months  in  kindergarten 

2.31 

(1.40) 

1.01/.01* 

(.01) 

Siblings  in  same  elementary  school 

40% 

.98/-.02 

(.02) 

No  ECE 

23% 

_ 

_ 

Home  ECE  in  child’s  home 

5% 

_ 

1.03/.03 

(.04) 

Home  ECE  in  other  home 

12% 

— 

1.02/.02 

(-03) 

Center  ECE  in  school 

18% 

_ 

1.01/.01 

(.03) 

Center  ECE  in  other 

42% 

_ 

1.01/.01 

(.02) 

Native  White 

51% 

_ 

_ 

Native  Black 

12% 

— 

.93/-.07** 

(.02) 

Native  Hispanic 

10% 

— 

.94/-.06* 

(.03) 

Native  American 

1% 

— 

.92/-.08 

(.05) 

Native  multiple  races 

3% 

— 

.94/-.06 

(.03) 

Asian 

3% 

— 

.99/— .01 

(.03) 

Non-Hispanic  immigrant 

5% 

— 

.92/-.09* 

(.03) 

Hispanic  immigrant 

15% 

— 

.89/-.  12** 

(.04) 

Non-English  household 

18% 

— 

.94/-.06 

(-03) 

Family  income  ($10,000s) 

5.89 

(4.91) 

1.00/.00 

(-00) 

Maternal  employment 

59% 

— 

1.00/.00 

(.02) 

Maternal  depression 

15% 

— 

.97/-.03 

(.02) 

Married  parents 

67% 

— 

1.00/.00 

(.02) 

Additional  adults  in  household 

30% 

— 

1.00/.00 

(.02) 

Children  in  household 

2.44 

(1.09) 

1.00/.00 

(.01) 

Family  instability 

1.13 

(1.15) 

1.00/.00 

(.01) 

Maternal  age  at  first  child  (years) 

23.70 

(5.80) 

1.00/.00 

(.00) 

Less  than  high  school  diploma 

11% 

— 

— 

High  school  diploma 

25% 

— 

.95/-.05 

(.03) 

Some  college  or  vocational  training 

32% 

— 

.97/-.03 

(.03) 

Bachelor’s  degree  or  higher 

33% 

— 

.98/-.02 

(.04) 

Urban 

71% 

— 

— 

— 

Suburban 

12% 

— 

1.07/.06** 

(.02) 

Rural  area 

17% 

— 

1.04/.04+ 

(.02) 

School  and  teacher  characteristics 

— 

Full-day  kindergarten 

75% 

— 

1.01/.01 

(.02) 

Public  elementary  school 

89% 

— 

.94/-.06** 

(.02) 

Class  size 

19.79 

(4.29) 

1.00/.00 

(-00) 

Years  teaching  kindergarten 

8.83 

(7.96) 

1.00/.00** 

(.00) 

Distance  <1  mile 

37% 

— 

— 

— 

Distance  1  to  2.5  miles 

24% 

— 

.99/— .0 1 

(.02) 

Distance  2.6  to  5  miles 

20% 

— 

1.00/.00 

(.02) 

Distance  5.1  to  10  miles 

13% 

— 

1.00/.00 

(.02) 

Distance  >10  miles 

5% 

— 

.97/-.03 

(.03) 

Intercept 

— 

— 

3.68/1.30 

(.18) 

Note,  n  =  4,900.  Results  aggregated  across  20  complete  imputed  data  sets.  Outcome  is  not  standardized.  IRR  =  incidence  rate  ratio;  SE  =  standard  error; 
ECE  =  early  childhood  education. 

+  p  <  .10.  ><.05.  *><.01.  ”><.001. 
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The  third  research  aim  assessed  whether  transition  practices 
were  associated  with  children’s  functioning  in  kindergarten.  In 
each  set  of  models,  five  separate  OLS  regression  models  were 
estimated  to  predict  each  of  the  five  different  child  outcomes  in 
kindergarten  (prosocial  skills,  attention  skills,  child  adjustment  to 
school,  reading  skills,  and  mathematics  skills).  Each  outcome  was 
standardized  for  these  models,  so  that  coefficients  indicate  asso¬ 
ciations  between  a  one-unit  shift  in  the  predictor  and  standard 
deviation  ( SD )  unit  shifts  in  the  dependent  variable.  The  first  set  of 
models  included  the  index  score  of  transition  practices  as  the 
independent  variable  of  interest.  The  second  set  of  analyses  in¬ 
cluded  the  seven  individual  practice  indicator  variables  as  the 
predictors  of  interest.  Examinations  of  variance  inflation  factors 
(VIF)  divulged  no  multicollinearity  concerns  for  concurrent  inclu¬ 
sion  of  all  seven  transition  practices.  VIF  scores  for  the  seven 
transition  practices  ranged  from  1.03  to  1.16  across  the  child 
outcome  models  (mean  VIF  scores  for  all  variables  in  these  mod¬ 
els  =  1.60).  The  third  and  fourth  sets  of  models  incorporated 
interactions  between  centered  measures  of  the  transition  practices 
and  family  income,  first  including  the  transition  sum  index,  and 
second  including  the  seven  individual  indicators.  Because  of  high 
multicollinearity  in  models  including  interactions  between  individ¬ 
ual  transition  indicators  and  family  income  (VIF  scores  ranged 
from  1.03  to  19.61),  models  were  run  including  one  interaction  at 
a  time. 

All  child  outcomes  models  included  the  full  set  of  child,  family, 
and  school  covariates.  Incorporation  of  such  a  rich  set  of  covari¬ 
ates,  many  of  which  are  associated  with  increased  engagement  in 
transition  practices,  helps  to  isolate  unique  associations  between 
transition  practices  and  children’s  successful  adjustment  to  kinder¬ 
garten.  Each  model  also  included  the  earlier  measure  of  the  child 
outcome  variable,  assessed  the  year  prior  to  kindergarten,  to  con¬ 
trol  for  the  child’s  prior  functioning.  Because  children’s  adjust¬ 
ment  to  kindergarten  was  not,  by  definition,  assessed  in  preschool, 
parent  report  of  children’s  shy  and  worried  behaviors  in  the  year 
prior  to  kindergarten  was  used  as  the  lag  in  the  model  predicting 
children’s  adjustment.  Inclusion  of  a  lagged  measure  of  child 
functioning  adjusts  for  additional  unmeasured  factors  that  have  a 
time-invariant  effect  on  children’s  functioning  (Cain,  1975),  al¬ 
lowing  us  to  interpret  coefficients  as  effects  of  transition  practices 
on  changes  in  children’s  functioning  over  time  (Kessler  &  Green¬ 
berg,  1981). 

Results 

Descriptive  Results 

Weighted  descriptive  data  on  the  sample  are  presented  in  the 
first  column  of  Table  1 .  Kindergarten  teachers  reported  engaging 
in  an  average  of  just  over  three  of  seven  transition  practices. 
Frequencies  of  individual  transition  items  demonstrate  that  transi¬ 
tion  practices  reaching  out  to  parents  were  the  most  common,  with 
the  vast  majority  of  kindergarten  teachers  reporting  phoning  or 
sending  information  home  to  parents  (90%)  and  having  a  parent 
orientation  (84%).  Similarly,  the  vast  majority  of  teachers  reported 
having  children  and  parents  visit  the  classroom  (84%),  but  fewer 
had  preschoolers  spend  time  in  the  classroom  (37%),  and  very  few 
teachers  engaged  in  home  visits  (4%)  or  structural  practices,  in¬ 


cluding  staggered  entry  (14%)  and  shortened  days  for  kindergar¬ 
teners  (10%). 

Child,  Family,  Teacher,  and  School  Characteristics 
Associated  With  School  Transition  Practices 

The  second  column  of  Table  1  presents  results  from  the  Poisson 
regression  model  predicting  the  total  transitions  sum  index.  The 
incident  rate  ratios,  or  exponentiated  coefficients,  delineate  the 
shift  in  estimated  incident  rates  of  transition  practices  for  a  one 
unit  shift  in  the  predictor.  Few  child  and  family  characteristics 
were  significantly  associated  with  the  number  of  transition  prac¬ 
tices.  Greater  time  in  kindergarten  was  associated  with  more 
transition  practices,  with  a  1  %  increase  in  the  number  of  practices 
with  each  additional  month  in  kindergarten,  suggesting  that  some 
of  the  practices  occurred  after  the  start  of  the  school  year.  Children 
with  a  disability  could  expect  to  have  kindergarten  teachers  who 
reported  5%  more  transition  practices  than  the  teachers  of  their 
typically  developing  peers.  Native-born  White  children  also  had 
teachers  reporting  greater  numbers  of  transition  practices  than  their 
peers  from  African  American,  native-born  Hispanic,  immigrant 
Hispanic,  and  other  immigrant  families,  with  differences  in  inci¬ 
dent  rates  ranging  from  6%  to  11%.  The  teachers  of  children 
residing  in  suburban  communities  also  reported  7%  greater  num¬ 
bers  of  transition  practices  than  teachers  of  children  in  urban 
communities.  Adjusting  for  these  factors,  family  income  and  other 
measures  associated  with  socioeconomic  status  and  instability 
were  not  significantly  associated  with  school  transition  practices. 
In  terms  of  teacher  and  school  characteristics,  children  attending 
public  elementary  schools  had  teachers  reporting  6%  fewer  tran¬ 
sition  practices  than  their  peers  in  private  schools.  In  addition, 
greater  teacher  experience  was  associated  with  a  negligible,  but 
statistically  significant,  increase  in  transition  practices. 

Transition  Practices  Associated  With 
Child  Functioning 

Children’s  behavioral  functioning.  Table  2  presents  results 
from  a  series  of  OLS  regression  models  testing  associations  be¬ 
tween  the  transition  practices  index  and  children’s  functioning 
after  the  entry  to  kindergarten,  adjusting  for  the  full  set  of  child, 
family,  and  school  co variates  and  for  children’s  functioning  in 
preschool,  assessed  with  the  same  individual  measure  used  as  the 
dependent  variable  (exception  noted  in  analytic  plan).  Models 
included  the  kindergarten  transition  total  sum  index  as  the  primary 
independent  variable,  and  standardized  measures  of  children’s 
outcomes,  such  that  coefficients  indicate  SD  unit  differences  in 
child  outcomes  related  to  a  one-unit  difference  in  each  predictor. 
Results  show  few  significant  links  between  transition  practices  and 
children’s  behavioral  functioning  in  kindergarten.  As  one  excep¬ 
tion,  the  number  of  transition  practices  was  associated  with  greater 
prosocial  skills  among  children,  with  one  additional  transition 
practice  associated  with  a  small,  but  significant,  .05  SD  unit 
increase  in  children  s  prosocial  skills.  This  finding  holds  even  after 
adjusting  for  multiple  comparisons  using  the  Bonferroni  correction 
(O.  J.  Dunn,  1961)  to  adjust  the  p  value  to  0.01. 

The  second  set  of  models  used  the  seven  individual  indicators  of 
transition  practices,  with  results  presented  Table  3.  These  models 
again  controlled  for  the  full  set  of  child,  family,  and  school 
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Table  2 


OLS  Models  With  the  Transition  Activities  Index  Predicting  Child  Outcomes 


Prosocial  skills 

Attention  skills 

Positive 

adjustment 

Reading 

Mathematics 

Variables 

B 

(SE) 

B 

(SE) 

B 

(SE) 

B 

(SE) 

B 

(SE) 

Transition  activities  index 

.05*’ 

(.02) 

.00 

(.02) 

.03 

(.02) 

.00 

(.01) 

.01 

(.01) 

Child  and  family  characteristics 

Preschool  wave  prosocial  skills 

.38*’* 

(.03) 

_ 

_ 

Preschool  wave  attention  skills 

— 

47*** 

(.03) 

_ 

_ 

_ 

_ 

_ 

_ 

Preschool  wave  positive  adjustment 

— 

— 

_ 

-.12*** 

(.03) 

_ 

_ 

_ 

_ 

Preschool  wave  reading 

Preschool  wave  mathematics 

— 

— 

— 

— 

.05*’* 

(.00) 

.06*** 

(.00) 

Child  disability 

-.18** 

(.06) 

_  19*** 

(.05) 

-.09 

(.06) 

—  .11” 

(.04) 

-.15*” 

(.04) 

Male 

-.27*** 

(.04) 

_ 27*** 

(.04) 

-.20*** 

(.04) 

-.02 

.03 

.05* 

(.03) 

Twin 

.02 

(-05) 

.08 

(.05) 

.05 

(.05) 

-.01 

.04 

.02 

(-04) 

Low  birth  weight 

-.03 

(.05) 

-.13” 

(.05) 

-.04 

(.05) 

-.07* 

.03 

-.13”* 

(.03) 

In  kindergarten  in  2007 

.04 

(-07) 

-.08 

(.06) 

.04 

(.07) 

.10’ 

(.05) 

.06 

(.04) 

Age  at  kindergarten  entry  (months) 

.01 + 

(.01) 

.03*** 

(.01) 

.01 

(.01) 

.01* 

(.01) 

.01” 

(.00) 

Months  in  kindergarten 

.01 

(.02) 

-.01 

(.01) 

—  ,04+ 

(.02) 

(.01) 

j  ^  *** 

(.01) 

Siblings  in  same  school 

.05 

(.04) 

.11” 

(.04)  • 

.05 

(.04) 

.00 

(-03) 

.06* 

(-03) 

No  EEC 

Home  ECE  in  child’s  home 

-.06 

(.11) 

-.11 

(.11) 

-.01 

(.13) 

.01 

(.08) 

.05 

(.07) 

Home  ECE  in  other  home 

.02 

(.08) 

.05 

(.08) 

.01 

(.08) 

-.04 

(.06) 

.02 

(.05) 

Center  ECE  in  school 

-.02 

(.07) 

-.1 

(.06) 

—  ,12+ 

(.07) 

.05 

(.05) 

-.04 

(.05) 

Center  ECE  in  other 

.08 

(.06) 

-.05 

(.06) 

-.05 

(.06) 

.05 

(-04) 

.02 

(.04) 

Native  White 

— 

— 

— 

— 

_ 

_ 

_ 

Native  Black 

.08 

(.07) 

.06 

(.06) 

-.06 

(.07) 

-.02 

(-04) 

—  ,16+ 

(.04) 

Native  Hispanic 

-.03 

(-07) 

-.05 

(.07) 

-.01 

(.08) 

.02 

(.05) 

—  .08+ 

(.05) 

Native  American 

-.13 

(.13) 

-.01 

(.15) 

-.26 

(.19) 

—  ,12+ 

(.07) 

-.25 

(-14) 

Native  multiple  races 

-.03 

(-10) 

-.03 

(.11) 

-.02 

(.08) 

-.03 

(.07) 

.01 

(.06) 

Asian 

-.08 

(.08) 

.01 

(.08) 

.05 

(.09) 

.12+ 

(.06) 

.06 

(-06) 

Non-Hispanic  immigrant 

-.09 

(.09) 

-.13 

(-09) 

.05 

(.09) 

.03 

(.07) 

-.04 

(.06) 

Hispanic  immigrant 

.09 

(.10) 

-.01 

(-09) 

-.12 

(.10) 

.01 

(.07) 

-.08 

(.07) 

Non-English  household 

.03 

(.09) 

.16+ 

(.08) 

.12 

(.09) 

-.02 

(-06) 

-.07 

(.06) 

Family  income  ($  10,000s) 

.01 

(.01) 

.01 

(.00) 

.01* 

(.01) 

.00 

(.00) 

.00 

(.00) 

Maternal  employment 

-.03 

(.04) 

-.01 

(.04) 

.03 

(.05) 

.02 

(.03) 

-.04 

(.03) 

Maternal  depression 

-.15** 

(.06) 

-.15” 

(.06) 

-.13* 

(-06) 

.03 

(.04) 

.02 

(.04) 

Married  parents 

.15” 

(.06) 

.08 

(.05) 

-.08 

(.06) 

.09* 

(.04) 

.05 

(.04) 

Additional  adults  in  household 

-.04 

(.05) 

-.03 

(.05) 

-.05 

(.05) 

-.05 

(-04) 

-.03 

(.03) 

Children  in  household 

-.04* 

(.02) 

-.03 

(.02) 

-.02 

(.02) 

-.03* 

(.01) 

.00 

(.01) 

Family  instability 

-.04* 

(.02) 

-.06” 

(.02) 

-.06** 

(.02) 

-.04” 

(.01) 

-.03” 

(.01) 

Maternal  age  at  first  child  (years) 

.00 

(.00) 

.01* 

(.00) 

-.01 

(.00) 

.00 

(.00) 

.00 

(.00) 

Less  than  high  school  diploma 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

High  school  diploma 

-.02 

(.07) 

.02 

(.07) 

.05 

(.08) 

.16” 

(-05) 

.12* 

(.05) 

Some  college  or  vocational  training 

.04 

(.08) 

.06 

(.08) 

.03 

(.08) 

.20*** 

(.06) 

.17” 

(-05) 

Bachelor’s  degree  or  higher 

.09 

(.09) 

.14 

(.09) 

-.06 

(.10) 

.31*** 

(-06) 

.27*** 

(.06) 

Urban 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

Suburban 

.04 

(.06) 

.04 

(.05) 

-.07 

(.07) 

-.04 

(.05) 

-.03 

(.04) 

Rural  area 

.06 

(.06) 

-.01 

(.06) 

.06 

(.06) 

-.07 

(.04) 

-.04 

(.04) 

School  and  teacher  characteristics 

Full-day  kindergarten 

.04 

(.05) 

.02 

(-04) 

-.14 

(.05) 

.12” 

(.03) 

.04 

(.03) 

Public  elementary  school 

-.10 

(.06) 

.03 

(.06) 

.02 

(.07) 

j 

(.04) 

-.05 

(.04) 

Class  size 

.00 

(.00) 

.01 

(.00) 

.00 

(.00) 

.00 

(.00) 

.01* 

(-00) 

Years  teaching  kindergarten 

.00 

(.00) 

.01” 

(.00) 

.00 

(.00) 

.00 

(.00) 

.00 

(.00) 

Distance  <1  mile 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

Distance  1  to  2.5  miles 

-.1* 

(.05) 

-.07 

(.05) 

.00 

(.05) 

-.01 

(.04) 

-.02 

(.03) 

Distance  2.6  to  5  miles 

-.07 

(.05) 

-.04 

(.05) 

-.05 

(.05) 

.02 

(.04) 

-.01 

(.04) 

Distance  5.1  to  10  miles 

-.07 

(.06) 

-.07 

(.06) 

.03 

(.06) 

-.02 

(.05) 

-.07 

(.04) 

Distance  >10  miles 

-.13 

(.09) 

-.08 

(.09) 

-.01 

(.09) 

-.07 

(.06) 

-.04 

(-06) 

Intercept 

.15 

(.13) 

.10 

(.12) 

.29 

(.13) 

-.45 

(.09) 

-.11 

(.09) 

R2  range 

.163- 

-.170 

.248- 

-.251 

.047-051 

.547- 

-.598 

.604—.613 

R2  average 

F-score  range 

1 1 .08- 

.1652 

-12.22 

24.12- 

.250 

-24.92 

.049 

2.76-3.01 

.552 

77.15-80.31 

91.11 

.610 

-96.29 

F-score  average 

11.46 

24.47 

2.87 

78.23 

94.27 

Note  n  =  4,900.  Results  aggregated  across  20  complete  imputed  data  sets.  All  outcome  variables  are  standardized.  OLS  =  ordinary  least  squares;  SE  = 
standard  error;  ECE  =  early  childhood  education. 

><.10.  >  <  .05.  *><.01.  **><.001. 
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Table  3 


OLS  Models  With  Individual  Transition  Practices  Predicting  Child  Outcomes 


Variables 

Prosocial  skills 

Attention  skills 

Positive 

adjustment 

Reading 

Mathematics 

B 

(SE) 

B 

(SE) 

B 

(SE) 

B 

(SE) 

B 

(SE) 

Preschoolers  visit  classroom 

.03 

(.04) 

.00 

(.04) 

.11* 

(.04) 

-.04 

(.03) 

-.01 

(.03) 

Parents  and  children  visit  class 

.05 

(.05) 

-.06 

(.05) 

.01 

(.06) 

-.02 

(.04) 

.02 

(.04) 

Teacher  home  visits 

-.04 

(.13) 

-.11 

(.12) 

.10 

(.09) 

-.04 

(.09) 

.02 

(.07) 

Send  information  to  parents 

.13’ 

(.06) 

.02 

(.06) 

.02 

(.07) 

.06 

(.05) 

.00 

(.04) 

Parent  orientation 

-.03 

(.05) 

.02 

(.05) 

-.02 

(.06) 

.12** 

(.04) 

.12** 

(.04) 

Staggered  entry  in  small  groups 

.02 

(.06) 

.02 

(.06) 

-.07 

(.07) 

-.03 

(.04) 

—  ,07+ 

(.04) 

Shortened  days 

.10 

(.06) 

.04 

(.06) 

.02 

(.07) 

.00 

(.05) 

.08* 

(.04) 

Intercept 

-.01 

(.15) 

.12 

(.14) 

.22 

(.15) 

-.55 

(.11) 

-.21 

(.10) 

R 2  range 

.164- 

,171 

.249- 

,253 

.049- 

,054 

.550-. 

561 

.607-. 

616 

R2  average 

166 

.251 

.052 

.555 

.613 

F-  score  range 

SO 

OO 

1 

10.79 

21.52- 

•22.31 

2.67- 

■2.87 

69.13-72.42 

81.71-87.80 

F- score  average 

10.16 

21.84 

2.75 

70.64 

85.23 

Note,  n  =  4,900.  Models  included  the  full  set  of  child,  family,  school,  and  teacher  covariates  shown  in  Table  2,  with  coefficients  for  covariates  not  shown. 
Results  aggregated  across  20  complete  imputed  data  sets.  All  outcome  variables  are  standardized.  OLS  =  ordinary  least  squares;  SE  =  standard  error. 
><.10.  >  <  .05.  *><.01.  **><.001. 


covariates  and  the  child  functioning  lag.  Again,  few  significant 
results  emerged.  Preschoolers  spending  time  in  the  kindergarten 
class  was  associated  with  a  .11  SD  unit  increase  in  children’s 
school  adjustment,  and  sending  information  home  to  parents  was 
associated  with  a  .13  SD  unit  increase  in  prosocial  behaviors. 
However,  when  adjusting  for  multiple  comparisons  using  the  Bon- 
ferroni  correction  (O.  J.  Dunn,  1961),  these  results  no  longer 
reached  statistical  significance.  Across  both  sets  of  models,  school 
transitions  were  not  significantly  associated  with  children’s  atten¬ 
tion  skills  in  kindergarten. 

Children’s  cognitive  functioning.  Table  2  and  Table  3  also 
show  results  from  the  models  predicting  children’s  cognitive  func¬ 
tioning  after  entry  to  kindergarten.  In  the  first  set  of  models  in 
Table  2,  no  significant  links  emerged  between  the  total  transition 
practices  index  and  children’s  reading  or  mathematics  skills.  How¬ 
ever,  consideration  of  individual  transition  practices,  with  models 
shown  in  Table  3,  found  that  provision  of  parent  orientations  was 
associated  with  .12  SD  unit  increases  in  both  children’s  reading 
skills  and  math  skills.  These  findings  are  robust  to  the  Bonferroni 
correction  adjusting  for  multiple  comparisons,  suggesting  that  to 
enhance  children’s  academic  outcomes,  practices  focused  on  par¬ 
ents  are  more  important  that  other  individual  practices,  or  simply 
more  practices.  In  addition,  shortened  days  for  kindergarteners  was 
associated  with  a  .08  SD  unit  increase  in  math  skills,  yet  this 
finding  was  not  statistically  significant  when  adjusting  for  multiple 
comparisons. 

In  the  behavioral  and  cognitive  models,  significant  covariates 
confirm  the  importance  of  accounting  for  other  child,  family,  and 
school  characteristics.  Across  all  models,  preschool  functioning 
was  strongly  positively  associated  with  children’s  kindergarten 
functioning,  whereas  child  disability  and  low  birth  weight  were 
often  negatively  associated.  Family  instability  was  negatively  as¬ 
sociated  with  all  arenas  of  functioning.  In  addition,  maternal 
depression  showed  significant  negative  associations  with  chil¬ 
dren’s  behavioral  outcomes,  and  higher  parent  education  was 
positively  associated  with  children’s  cognitive  outcomes. 

Interactions  between  transition  practices  and  family  income. 
The  final  set  of  models  tested  whether  associations  between  kin¬ 


dergarten  transition  practices  and  children’s  functioning  were 
moderated  by  family  income.  We  first  tested  interactions  between 
family  income  and  the  total  transitions  sum  index,  and  next  tested 
interactions  between  income  and  each  of  the  individual  transition 
practices  in  separate  models.  After  adjusting  for  multiple  compar¬ 
isons  for  five  different  outcomes,  using  the  Bonferroni  correction 
to  get  an  adjusted  p  value  of  0.01  (O.  J.  Dunn,  1961),  no  significant 
interactions  emerged  between  the  transition  index  and  family 
income  (results  presented  in  the  online  supplemental  materials). 
Similarly,  no  significant  interactions  emerged  between  family  in¬ 
come  and  individual  transition  practice  indicator  variables,  sug¬ 
gesting  that  links  between  transition  practices  and  children’s  func¬ 
tioning  were  similar  across  the  family  income  distribution. 

Discussion 

Successful  transitions  to  kindergarten  have  both  short-term  and 
long-term  implications  for  children  and  their  families  (Cowan  et 
ah,  2005;  Entwisle  &  Alexander,  1993;  Ladd  et  ah,  2000;  Ladd  & 
Price,  1987;  Pianta  &  Kraft-Sayre,  2003).  As  such,  it  is  essential  to 
understand  how  schools  can  help  to  ease  this  process  for  kinder¬ 
garteners,  supporting  smooth  and  positive  transitions.  This  study 
adds  to  the  literature  by  exploring  the  transition  practices  used  by 
kindergarten  teachers;  assessing  the  child,  family,  teacher,  and 
school  characteristics  associated  with  greater  use  of  transition 
practices;  and  delineating  how  engagement  in  different  practices 
relates  to  children  s  social  and  academic  adjustment  in  kindergar¬ 
ten  in  a  nationally  representative  sample  of  children. 

Prevalence  and  Correlates  of  Transition  Practices 

Through  assessment  of  children  who  entered  kindergarten  in 
2006-2007,  we  provide  an  updated  picture  of  the  transition  prac¬ 
tices  reported  by  their  kindergarten  teachers.  On  average,  we  found 
that  kindergarten  teachers  reported  engaging  in  just  over  three  of 
seven  transition  practices,  with  outreach  to  parents  (i.e.,  sending 
information  home  and  holding  parent  orientations)  and  children 
and  parents  visiting  the  classroom  as  the  most  common  practices. 
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Structural  changes  to  the  school  schedule  in  the  beginning  of  the 
year  such  as  staggered  entry  or  shortened  days  to  ease  the  transi¬ 
tion  were  much  less  frequent.  To  gain  a  better  understanding  of  the 
contexts  in  which  teachers  were  engaging  in  greater  transition 
practices,  we  assessed  how  a  broad  range  of  child,  family,  school, 
and  teacher  characteristics  were  associated  with  transition  prac¬ 
tices.  Results  showed  that  teachers  with  more  experience,  as  well 
as  those  teaching  in  private  schools,  reported  engaging  in  signif¬ 
icantly  more  transition  practices  than  their  peers. 

These  models  also  found  that  children  in  African  American, 
Hispanic,  and  immigrant  families,  as  well  as  those  living  in  urban 
areas,  had  teachers  who  reported  fewer  transition  practices  than 
those  of  their  peers.  This  is  consistent  with  past  research  that  has 
shown  that  the  use  of  transition  practices  differs  across  schools  and 
the  demographics  of  the  children  they  serve  (Early  et  al„  1999; 
Love  et  al„  1992;  Pianta  et  ah,  1999;  Schulting  et  ah,  2005).  In 
contrast  to  prior  research  considering  links  between  risk  factors 
and  transition  practices  (LoCasale-Crouch  et  ah,  2008;  Schulting 
et  ah,  2005),  we  did  not  find  that  family  income  was  associated 
with  the  prevalence  of  transition  practices,  perhaps  suggesting  that 
bivariate  connections  between  income  and  transitions  may  be 
driven  by  differences  related  to  other  family  characteristics  such  as 
race  and  ethnicity,  nativity,  and  urbanicity. 

Transition  Practices  and  Child  Outcomes 

This  study  also  sought  to  extend  the  very  limited  prior  research 
base  linking  kindergarten  transition  practices  to  successful  kinder¬ 
garten  functioning  among  children.  Expanding  on  past  research, 
this  study  considered  both  cognitive  and  behavioral  aspects  of 
children’s  functioning  in  kindergarten,  reported  by  parents,  teach¬ 
ers,  and  direct  assessments.  Moreover,  using  longitudinal  data,  our 
models  were  able  to  adjust  for  a  broad  range  of  covariates  as  well 
as  for  earlier  measures  of  children’s  functioning,  central  tech¬ 
niques  to  help  isolate  unique  associations  between  transition  prac¬ 
tices  and  children’s  kindergarten  functioning  and  to  limit  concerns 
over  unmeasured  heterogeneity  or  selection  bias.  Still,  we  caution 
that  the  work  is  correlational,  and  unmeasured  factors  may  have 
continued  to  bias  results. 

Prospective  models  predicting  children’s  functioning  in  kinder¬ 
garten  found  limited  support  for  the  hypothesis  that  broader  use  of 
transition  practices  would  support  children’s  adjustment  to  kinder¬ 
garten.  Specifically,  results  found  that  use  of  more  transition 
activities  by  kindergarten  teachers  was  predictive  of  enhanced 
prosocial  skills  among  children,  but  was  not  associated  with  chil¬ 
dren’s  attention  skills,  positive  adjustment,  or  reading  and  math 
skills  in  kindergarten.  These  results  provide  limited  evidence  to 
support  the  “more  is  better”  view  of  transition  practices  in  which 
engaging  in  multiple  different  types  of  practices  will  provide  the 
most  support  to  parents  and  children,  thereby  enhancing  positive 
child  functioning  in  kindergarten. 

We  also  assessed  whether  specific  types  of  transition  practices 
were  linked  to  particular  aspects  of  children’s  functioning.  Here, 
results  found  that  holding  parent  orientations  was  associated  with 
increases  in  children’s  math  and  reading  scores.  These  results 
suggest  that  sharing  expectations  with  parents  concerning  the 
academic  demands  and  practices  of  kindergarten  might  help  to 
support  children’s  academic  skill  gains.  Parent  orientations  often 
relay  expectations  regarding  daily  reading  at  home,  for  example, 


and  introduce  parents  to  concepts  such  as  homework  folders  and 
materials,  which  they  are  expected  to  review  with  their  children 
nightly.  Additional  research  is  needed  to  further  explore  these 
potential  mechanisms,  seeking  to  elucidate  the  messages  that  are 
relayed  in  parent  orientations,  and  to  delineate  how  parents  receive 
and  interpret  such  information.  Together,  these  results  support  a 
growing  base  of  research  supporting  the  importance  of  family- 
school  connections  in  promoting  school  success  among  children 
(Pomerantz  et  al.,  2007). 

Although  our  results  replicated  and  extended  prior  work  show¬ 
ing  the  primary  importance  of  parent  orientations,  results  did  not 
replicate  prior  research  that  has  found  transition  practices  to  be 
more  beneficial  for  children  at  risk  because  of  limited  family 
economic  resources  (LoCasale-Crouch  et  al.,  2008;  Schulting  et 
al.,  2005).  In  our  models,  children  from  lower  income  families 
neither  had  teachers  reporting  fewer  transition  practices,  nor 
showed  greater  or  lesser  benefits  from  such  practices  than  their 
more  economically  advantaged  peers.  One  possible  explanation  for 
this  inconsistency  is  simply  that  our  analytic  models  more  fully 
adjusted  for  a  variety  of  other  correlated  characteristics  of  families 
and  schools. 

Limitations  and  Directions  for  Future  Research 

This  study  adds  to  the  literature  by  providing  an  updated  portrait 
of  the  transition  practices  used  by  kindergarten  teachers  and  their 
associations  with  both  social  and  academic  functioning  among 
new  kindergarteners.  Yet  our  understanding  of  the  use  and  utility 
of  transition  practices  remains  somewhat  limited.  This  and  other 
past  studies  asked  teachers  about  the  general  practices  they  en¬ 
gaged  in  for  children  in  their  class,  but  did  not  specify  which 
children  and  parents  actually  participated  in  the  activities.  In 
addition,  teachers  reported  only  on  whether  or  not  they  engaged  in 
each  transition  practice,  providing  no  information  on  the  quality, 
frequency,  or  intensity  of  the  activities.  For  example,  when  teach¬ 
ers  reported  that  they  did  home  visits,  it  is  not  clear  whether  home 
visits  were  conducted  for  all  children  in  their  class  or  just  specific 
children,  or  the  content  and  nature  of  the  visit.  Past  research  has 
also  suggested  that  there  are  many  barriers  teachers  face  to  engag¬ 
ing  in  transition  practices,  including  lack  of  time  and  pay  during 
the  summer  months  and  getting  class  lists  too  late  (Early,  Pianta, 
Taylor,  &  Cox,  2001).  Future  research  should  seek  to  more  richly 
describe  the  reach  and  intensity  of  transition  practices  as  well  as 
optimal  structures  for  supporting  teachers  in  their  full  use. 

In  addition,  this  study  looked  primarily  at  horizontal  transition 
practices  that  connect  children  and  families  to  the  elementary 
schools  they  are  entering,  and  did  not  address  the  early  education 
settings  they  were  coming  from  or  the  transition  practices  that  may 
have  been  led  by  children’s  early  education  settings.  Following  the 
developmental  ecological  transition  to  kindergarten  model  (Rimm- 
Kaufman  &  Pianta,  2000),  it  is  important  for  more  research  to 
address  connections  made  between  the  early  childhood  settings 
children  are  leaving  and  the  elementary  schools  they  are 
entering — so-called  vertical  transition  practices.  Updated  research 
is  needed  to  explore  the  barriers  teachers  face  in  engaging  in  both 
horizontal  and  vertical  transition  practices  and  their  relationships 
to  both  short  and  long-term  functioning  among  children. 

Finally,  it  is  essential  to  reiterate  some  limitations  of  the  data 
and  statistical  methods  used  in  this  research.  Although  we  assessed 


176 


COOK  AND  COLEY 


a  large  sample  of  children,  followed  prospectively,  and  incorpo¬ 
rated  sampling  weights  which  were  designed  to  allow  generaliz- 
ability  to  all  children  bom  in  the  United  States  in  2001,  it  is 
possible  that  weights  did  not  fully  adjust  for  all  nonresponse  and 
attrition  bias.  Moreover,  although  we  adjusted  for  a  broad  range  of 
child,  family,  teacher,  and  school  covariates,  our  models  remain 
correlational,  and  cannot  support  causal  conclusions.  Moreover,  it 
is  important  to  note  that  the  size  of  the  significant  effects  unearthed 
in  this  research  was  consistently  small. 

Implications  for  Policy  and  Practice 

Despite  the  aforementioned  limitations,  this  study  provides  in¬ 
sight  into  the  types  of  transition  practices  that  may  help  support 
children’s  positive  adjustment  to  school.  In  an  environment  in 
which  school  resources  are  limited,  and  districts,  schools,  and 
teachers  must  make  choices  about  how  to  spend  their  financial  and 
human  resources  effectively,  educators  are  pushed  to  make  strate¬ 
gic  decisions  about  the  practices  they  engage  in  to  best  support 
children’s  successful  transitions  to  school.  This  study  extends 
results  from  the  very  limited  prior  research  on  transition  practices 
engaged  in  by  kindergarten  teachers  in  the  United  States  (Ahtola  et 
al.,  2011;  LoCasale-Crouch  et  al.,  2008;  Schulting  et  ah,  2005)  to 
suggest  that  a  broader  use  of  diverse  transition  practices  (more 
practices)  may  support  children’s  prosocial  behaviors  with  peers. 
Our  findings  also  suggest  that  outreach  specifically  to  parents 
through  parent  orientations  may  be  a  key  transition  practice  for 
supporting  children’s  academic  success  in  both  early  reading  and 
mathematics.  Moreover,  the  relationships  between  transitions 
practices  and  children’s  social  and  academic  success  in  kindergar¬ 
ten  were  not  moderated  by  income  in  a  nationally  representative 
sample,  suggesting  that  transition  practices  are  beneficial  for  chil¬ 
dren  across  the  income  spectrum. 
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Theory  and  research  using  a  social-information  processing  framework  indicate  that  reward-focused 
(proactive)  aggression  has  different  social  consequences  than  defense-focused  (reactive)  aggression. 
Students  use  norms  that  identify  expected  and  socially  approved  behaviors  as  guides  to  their  own 
actions.  Differences  in  social-cognitive  processing  characteristics  and  social  status  linked  to  each 
type  of  aggression  may  increase  the  relevance  of  some  normative  sources  relative  to  others.  This 
study  fills  a  gap  in  the  literature  by  examining  the  contributions  of  personal  beliefs,  classroom 
beliefs,  and  classroom  rates  of  aggression  to  future  proactive  and  reactive  aggression.  During  fall 
and  spring,  we  observed  students’  aggression  on  school  playgrounds  using  a  random  subsample  ( n  — 
254)  of  consented  students  from  35  classrooms  (Grades  3-6).  We  calculated  classroom  rates  of 
proactive  and  reactive  aggression  from  fall  observations.  Classroom  means  for  beliefs  endorsing 
retaliation  were  calculated  from  surveys  of  536  students.  Results  of  multilevel  analyses  revealed,  as 
hypothesized,  that  personal  beliefs  predicted  high  rates  of  students’  proactive  aggression,  but  not 
reactive  aggression.  Classroom  beliefs  predicted  high  rates  of  students’  reactive  but  not  proactive 
aggression.  Students  in  classrooms  with  high  rates  of  fall  proactive  aggression  showed  low  spring 
rates  of  both  types  of  aggression.  In  contrast,  students  in  classrooms  with  high  rates  of  fall  reactive 
aggression  displayed  high  spring  rates  of  proactive  and  reactive  aggression.  The  latter  pattern  may 
represent  classrooms  in  which  students  continue  to  struggle  against  status  inequities.  The  discussion 
examines  how  inequities  may  impact  intervention  efforts. 

Keywords:  proactive  aggression,  reactive  aggression,  retaliatory  beliefs,  classroom  norms,  peer  influence 


Aggression  is  a  chronic  problem  in  our  nation’s  schools, 
leading  many  educators  to  adopt  aggression  reduction  pro¬ 
grams.  These  programs  often  aim  to  promote  normative  beliefs 
and  expectations  among  students  that  aggression  is  unaccept¬ 
able  or  ineffective.  Effectiveness  varies,  however  (Ansary, 
Elias,  Greene,  &  Green,  2015;  Ttofi  &  Farrington,  2011),  and 
efficacy  of  intervention  practices  may  depend  on  the  functions 
served  by  specific  normative  beliefs  and  aggressive  behaviors. 
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Functional  distinctions  between  proactive  aggression  (also 
called  instrumental  or  reward-focused)  and  reactive  aggression 
(also  called  retaliatory  or  defensive)  are  associated  with  differ¬ 
ent  patterns  of  social  goals,  outcome  expectations,  and  self- 
regulatory  abilities  (see  reviews  and  meta-analyses;  Card  & 
Little,  2006;  Hubbard,  McAuliffe,  Morrow,  &  Romano,  2010; 
Polman,  de  Castro,  Koops,  van  Boxtel,  &  Merk,  2007).  Thus, 
normative  beliefs  at  the  personal  and  classroom  levels  and 
typical  patterns  of  classroom  behavior  may  differentially  pre¬ 
dict  changes  in  the  two  types  of  aggression. 

Despite  a  rich  history  of  theoretical  and  empirical  work  on 
normative  influences  (e.g.,  Bandura,  Caprara,  Barbaranelli,  Pa- 
storelli,  &  Regalia,  2001;  Huesmann,  1988;  Henry  et  al.,  2013) 
and  on  functional  distinctions  in  aggression  (Crick  &  Dodge, 
1994;  Dodge  &  Coie,  1987;  Marsee  et  al.,  2014),  few 
studies  have  examined  the  specific  contributions  of  aggressive 
norms  to  proactive  and  reactive  aggression.  Key  variations  in 
the  social-cognitive  processes  and  the  Social  context  associated 
with  each  aggression  type  suggest  that  the  relative  influence  of 
social  norms  will  differ  in  systematic  ways.  This  study  em¬ 
ployed  Dodge’s  social  information  processing  model  (SIP;  Fon¬ 
taine  &  Dodge,  2006)  as  a  framework  for  examining  the  con¬ 
tributions  of  specific  social  norms  to  proactive  and  reactive 
aggression. 
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Theory  and  Research  on  Proactive  and 
Reactive  Aggression 

SIP  links  the  selection  and  enactment  of  social  behaviors  to 
six  elements  in  a  nearly  instantaneous  decision-making  process 
(Fontaine  &  Dodge,  2006).  Each  step  in  the  process  can  be 
performed  with  varying  degrees  of  adequacy,  depending  on 
situational  and  personal  characteristics.  In  Step  1,  a  student 
notices  social  cues  (e.g.,  a  derisive  tone  of  voice,  the  presence 
of  bystanders),  which  form  the  basis  for  social  inferences  in 
Step  2  (e.g.,  “This  person  wants  to  embarrass  me  publicly.”).  In 
Step  3,  a  student  identifies  personal  goals  (e.g.,  reduce  negative 
emotions,  reclaim  lost  status).  In  Step  4,  a  student  generates 
alternative  ways  to  achieve  a  goal  (e.g.,  run  away,  hit  the  kid, 
wait  and  retaliate  anonymously).  Students  evaluate  alternatives 
in  Step  5  based  on  beliefs  that  reflect  sociomoral  values,  self- 
efficacy  (e.g.,  “I’m  not  a  good  fighter.”),  and  expectations  of 
success  in  a  specific  situation  (e.g.,  “I’m  alone,  but  that  per¬ 
son’s  friends  are  here”).  Finally,  in  Step  6,  students  perform  the 
selected  action.  These  steps  are  performed  in  emotional  con¬ 
texts  that  vary  in  valence  and  intensity. 

A  high  level  of  emotional  arousal  is  likely  to  interfere  with 
decision-making  processes.  This  leads  to  actions  that  may  be  based 
on  inadequate  attention  to  social  cues,  snap  judgments  about 
others’  intentions,  and  failure  to  evaluate  the  adequacy  of  possible 
responses.  Highly  emotional,  impulsive  responses  are  typical  of 
reactive  aggression  (Card  &  Little,  2006;  Hubbard  et  al.,  2010; 
Polman  et  al.,  2007).  They  may  be  provoked  by  unjust  or  otherwise 
aversive  situations.  Moreover,  reactive  aggression  is  associated 
with  biases  toward  encoding  and  interpreting  social  cues  as  hostile, 
and  responses  that  are  retaliatory. 

Proactive  aggression  refers  to  intentional  acts  of  aggression 
that  are  aimed  at  achieving  a  desired  goal  and  are  contingent  on 
evaluation  of  consequences  (Fontaine  &  Dodge,  2006).  Proac¬ 
tive  aggression  is  uniquely  related  to  bullying:  the  repeated 
targeting  of  a  person  of  lesser  power  (Fossati  et  al.,  2009). 
Although  proactive  aggression  may  be  accompanied  by  anger, 
emotion  expression  is  likely  to  be  regulated  in  ways  that  enable 
strategic  decision-making  and  goal  acquisition  (Hubbard  et  al., 
2002).  Retaliation,  for  example,  may  be  delayed  in  order  to 
assure  a  successful  outcome  and  an  audience  for  the  antag¬ 
onist’s  humiliation  (Dodge,  1991).  Manipulation  of  peers  (Lit¬ 
tle,  Henrich,  Jones,  &  Hawley,  2003)  may  also  disguise  the 
avenger’s  identity  or  incite  allies  to  retaliate  (Frey,  Pearson,  & 
Cohen,  2015;  Garandeau  &  Cillessen,  2006;  Xie,  Swift,  Cairns, 
&  Cairns,  2002). 

In  general,  proactive  aggression  appears  to  be  associated  with 
successful  goal  achievement  more  than  reactive  aggression.  Reac¬ 
tive  aggression  is  related  to  peer  rejection  (Evans,  Fite,  Hendrick¬ 
son,  Rubens,  &  Mages,  2015)  and  victimization  (Salmi valli  & 
Helteenvuori,  2007),  while  proactive  aggression  is  associated  with 
a  dominant  position  in  the  social  hierarchy  (Pellegrini,  Bartini,  & 
Brooks,  1999;  Pellegrini  et  al.,  2011;  Sijtsema,  Veenstra,  Linden- 
berg,  Salmivalli,  2009;  cf.,  Polman,  de  Castro,  Thomaes,  &  van 
Aken,  2009).  The  status  that  often  accompanies  proactive  aggres¬ 
sion  may  be  a  key  element  of  success  as  students  navigate,  influ¬ 
ence  and  even  co-opt  classroom  norms  for  behavior. 


Normative  Beliefs  and  Behavior 

Normative  influences  include  internal,  personal  beliefs,  and 
those  that  are  shared  within  groups  such  as  classrooms.  Personal 
beliefs  become  visible  to  classmates  via  gossip,  advice,  and  ex¬ 
hortations.  When  particular  values  and  beliefs  are  widely  shared, 
they  become  part  of  the  classroom  culture  (Hawley  &  Williford, 
2015). 

Normative  beliefs  are  injunctive  norms — what  most  people 
think  should  happen  (Henry,  2008).  They  describe  beliefs  about 
moral  and  conventional  obligations  and  constraints.  Normative 
beliefs  provide  support  for  self-regulatory  efforts  as  people  try  to 
act  in  ways  congruent  with  personal  and  cultural  definitions  of  a 
good  person  (Bandura  et  al.,  2001).  Normative  behavior,  on  the 
contrary,  is  a  descriptive  norm,  based  on  patterns  of  actions  and 
reactions  that  typify  social  groups.  Classmates  can  observe  behav¬ 
iors,  such  as  aggressive  acts  that  are  common  in  specific  situations, 
and  the  attendant  rewards  and  punishments. 

Common  classroom  patterns  associated  with  aggression  may 
acquire  a  veneer  of  perceived  legitimacy  as  in  “Everybody  does 
it.”  (Ang,  Ong,  Lim,  &  Lim,  2010;  Chang,  2004),  even  if  they  are 
not  sanctioned  by  shared  classroom  beliefs.  The  responses  of 
victims  and  bystanders  to  aggression  contribute  to  classroom 
norms  and  outcome  expectations  by  determining  whether  an  ag¬ 
gressive  act  is  viewed  as  efficacious,  rewarding,  or  costly  (Pel¬ 
legrini,  2008).  In  order  to  better  understand  normative  influences 
on  aggression,  we  need  to  consider  descriptive  norms  such  as 
classroom  rates  of  aggression,  as  well  as  injunctive  norms  such  as 
beliefs  regarding  aggression  at  the  personal  and  classroom  level. 

Past  Research  on  Normative  Contributions 
to  Aggression 

Previous  research  identified  links  between  personal  beliefs 
endorsing  aggression  and  aggressive  behavior  in  concurrent 
analyses  (Kikas,  Peets,  Tropp,  &  Hinn,  2009;  Werner  &  Nixon, 
2005)  and  longitudinal  analyses  (Henry  et  al.,  2000;  Huesmann 
&  Guerra,  1997;  Werner  &  Hill,  2010).  Physical  and  relational 
aggression  generally  show  similar  relationships  to  beliefs.  Fur¬ 
ther,  beliefs  that  justify  bullying  are  linked  to  high  levels  of 
bullying  behavior  (Perren,  Gutzwiller-Helfenfinger,  Malti,  & 
Hymel,  2012;  Sentse,  Veenstra,  Kiuru,  &  Salmivalli,  2015). 

Normative  beliefs  that  are  widely  shared  specify  who  is  rewarded 
by  aggression,  how,  and  under  what  circumstances.  Past  research 
shows  inconsistent  links  between  classroom  beliefs  (usually  calcu¬ 
lated  as  the  group  mean  of  personal  beliefs)  and  later  aggressive 
behavior  (Henry  et  al.,  2000;  Sentse  et  al.,  2015;  Werner  &  Hill, 
2010). 

The  level  of  aggressive  actions  within  schools,  classrooms,  and 
peer  groups  has  also  been  shown  to  contribute  to  later  student  aggres¬ 
sion  (Mercer,  McMillen,  &  DeRosier,  2009;  Sentse  et  al.,  2015; 
Thomas,  Bierman,  &  Powers,  201 1).  Most  multilevel  studies  have  not 
included  both  normative  beliefs  and  behaviors  as  predictors  of  later 
aggression.  Those  that  have  are  inconclusive  regarding  the  relative 
magnitude  or  possible  causal  ordering  of  each  type  of  normative 
influence  (e.g.,  Henry  et  al.,  2000;  Huesmann  &  Guerra,  1997;  Sentse 
et  al.,  2015). 
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Norms  Are  Pertinent  to  Response  Evaluation 

According  to  the  SIP  model,  normative  influences  derived  from 
personal  and  classroom  beliefs  and  classroom  rates  of  aggression 
are  most  likely  contribute  to  social  decision-making  during  Step  5: 
response  evaluation.  Relatively  stable  mental  structures  such  as 
beliefs  about  sociomoral  acceptability,  self-identity,  and  expecta¬ 
tions  of  positive  and  negative  outcomes  form  the  basis  for  response 
evaluation  (Fontaine  &  Dodge,  2006).  Given  the  processing  dif¬ 
ferences  involved  in  proactive  and  reactive  aggression,  the  rele¬ 
vance  of  norms  in  decision-making  is  likely  to  vary  by  the  type  of 
aggression  and  the  type  and  source  of  the  normative  influence. 

Normative  Beliefs  Endorsing  Retaliation 

This  study  focused  specifically  on  personal  and  classroom  beliefs 
regarding  retaliation,  which  vary  according  to  culture  and  geographic 
region  (Cohen,  2001).  It  might  be  assumed  that  beliefs  endorsing 
retaliation  are  pertinent  only  for  reactive  (i.e.,  retaliatory)  aggression, 
but  two  lines  of  inquiry  suggest  otherwise.  First,  personal  beliefs 
endorsing  retaliation  appear-  to  disengage  moral  constructs  against 
harming  others  (Bandura  et  al.,  2001)  with  either  proactive  or  reactive 
aggression.  Endorsement  of  retaliation  and  justifications  based  on 
prior  offense  are  associated  with  high  levels  of  bullying,  for  example 
(Bradshaw,  O’Brennan,  &  Sawyer,  2008;  Perren  et  al„  2012).  Further, 
group  norms  regarding  retaliation  are  likely  to  affect  the  costs  and 
rewards  associated  with  aggression,  as  well  as  being  reflected  in  the 
decision-making  process.  Thus,  beliefs  endorsing  retaliation  may  be 
relevant  for  both  proactive  and  reactive  aggression.  But  as  detailed 
next,  contribution  may  vary  with  the  source  (personal  or  classroom 
beliefs),  and  a  student’s  online  processing  adequacy  when  aggression 
becomes  a  possibility. 

Do  Norms  Differentially  Predict  Students’  Proactive 
and  Reactive  Aggression? 

Despite  increasing  interest  in  ecological  influences  on  aggres¬ 
sion  (see  review,  Pellegrini,  2008),  the  normative  influences  on 
students’  proactive  and  reactive  aggression  have  virtually  been 
ignored.  To  our  knowledge,  only  one  study  on  young  adults  has 
examined  the  relationships  of  reactive  and  proactive  aggression  to 
personal  beliefs  (Bailey  &  Ostrov,  2008).  Since  it  is  unclear 
whether  research  on  young  adults  can  be  applied  to  elementary 
classrooms,  more  investigation  is  warranted  on  normative  influ¬ 
ences,  particularly  in  the  school  setting. 

Elementary  schoolchildren  spend  much  of  the  school  day  with 
classmates,  who  are  likely  to  be  a  potent  source  of  normative 
influence.  Whether  their  influence  makes  similar  contributions  to 
proactive  and  reactive  aggression  is  a  question  of  both  theoretical 
and  practical  importance.  There  is  a  particular  need  for  longitudi¬ 
nal  data,  since  normative  influences  may  significantly  moderate 
intervention  efficacy.  Our  study  responds  to  this  need  by  investi¬ 
gating  the  complex  relationships  between  specific  normative  in¬ 
fluences  at  the  personal-  and  classroom-levels  and  subsequent 
aggressive  actions.  Consistent  with  behavioral  ecological  analyses 
(Pellegrini,  2008),  we  assume  that  classroom  beliefs  and  classroom 
rates  of  proactive  and  reactive  aggression  will  create  costs  and 
benefits  that  are  differentially  reflected  in  students’  rates  of  pro¬ 
active  and  reactive  aggression. 


Current  Study 

The  data  for  the  current  study  were  collected  as  part  of  an 
evaluation  of  an  antibullying  program.  We  conducted  fall  and 
spring  playground  observations  and  student  surveys  of  beliefs 
endorsing  retaliation  for  students  in  Grades  3-6.  These  grades 
precede  declines  in  the  effectiveness  of  bullying  prevention  pro¬ 
grams  (Yeager,  Fong,  Lee,  &  Espelage,  2015),  but  coincide  with 
developmental  changes  that  are  eventually  likely  to  be  relevant  to 
program  effectiveness.  For  example,  approval  for  retaliation  in¬ 
creases  during  these  grades  (Frey,  Hirschstein,  Edstrom,  &  Snell, 
2009;  Huesmann  &  Guerra,  1997),  as  does  sensitivity  to  social 
cues  (Blakemore  &  Mills,  2014). 

To  our  knowledge,  this  is  the  first  multilevel  longitudinal  study 
to  distinguish  between  the  contributions  that  normative  beliefs  and 
behaviors  make  to  each  type  of  aggression.  Our  hypotheses  predict 
the  separate  contributions  of  personal  beliefs,  classroom  beliefs, 
classroom  rates  of  proactive  aggression,  and  classroom  rates  of 
reactive  aggression  to  students’  proactive  and  reactive  aggression. 
We  next  provide  the  rationale  for  each  hypothesis.  Hypotheses  are 
also  summarized  in  Figure  1. 

Hypotheses  for  Proactive  Aggression  Reflect  Goal 
Focus  and  Peer  Dominance 

Personal  Beliefs  Considered  During 
Response  Evaluation 

The  goal  of  proactive  aggression  is  to  obtain  rewards  with  minimal 
cost.  This  strategic  quality  suggests  that  response  evaluation.  Step  5, 
will  play  an  important  role  in  the  social  decision-making  process 
(Fontaine  &  Dodge,  2006).  Stable  mental  structures  such  as  personal 
beliefs  will  inform  response  evaluation  and  selection.  For  example, 
aggressive  self-efficacy,  a  characteristic  of  proactive  aggressors  (Hub¬ 
bard  et  al.,  2010),  predicts  retaliation  (Erdley  &  Asher,  1996).  To 
someone  with  a  goal  of  being  socially  dominant,  provocations  that  go 
unpunished  may  imperil  goal  achievement  and  thus  justify  aggression 
(Boyd,  2014).  Low  impulsivity  also  means  that  aggressors  who  are 
predominantly  proactive  can  maintain  a  revenge  goal  over  time  and 
strategize  accordingly.  With  these  considerations  in  mind,  we  pre¬ 
dicted  that  personal  beliefs  endorsing  retaliation  would  contribute  to 
high  rates  of  students’  proactive  aggression  in  the  spring  after  ac¬ 
counting  for  prior  aggression  rates. 

Leadership  May  Negate  Contribution 
of  Classroom  Beliefs 

As  a  well-regulated  goal-directed  behavior,  proactive  aggression  may 
be  relatively  unaffected  by  in-the-moment  peer  pressure  to  retaliate 
(Thomas  &  McGloin,  2013).  Successful  proactive  aggressors,  however, 
will  be  cognizant  of  threats  to  their  dominance  goals  if  they  lose  peer 
support  for  their  aggression  (Veenstra,  Lindenberg,  Munniksma,  &  Di- 
jkstra,  2010).  Importantly,  proactive  aggressors  try  to  shape  group  norms 
to  their  own  specifications  (Allen,  Porter,  &  McFarland,  2006).  Manip¬ 
ulating  classmates’  interpretation  of  events  (Little  et  al.,  2003)  by  appeal¬ 
ing  to  retaliation  norms  may  be  an  efficacious  strategy.  In  speaking  with 
allies,  for  example,  proactive  aggressors  may  recast  their  own  aggression 
as  a  justifiable  response  to  victim  retaliation  (Boyd,  2014).  If  proactive 
aggressors  have  a  disproportionate  influence  on  classroom  beliefs,  then 
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Aeeression 

Norm  Type  Contribution  Type  Possible  Social  Psvcholoeical  Mechanisms 


Goal-focus  supports  response  evaluation  process 


Impulsivity  inhibits  response  evaluation  process 


Normative  leaders  advance  personal  beliefs 


Group  expression  fuels  arousal,  resignation 


Hierarchy  enables  nonaggressive  influence  efforts 


Rejection  and  intimidation  foster  passivity 


Peer  rejection,  resistance  facilitate  justifications 


Hierarchy  resistance,  antagonisms  become  chronic 


Figure  I.  Model  illustrates  the  hypothesized  contributions  of  normative  beliefs  and  classroom  rates  of 
aggression  to  subsequent  personal  rates  of  proactive  and  reactive  aggression. 


those  beliefs  are  unlikely  to  contribute  significantly  to  later  proactive 
aggression  beyond  the  contribution  of  personal  beliefs  (Sentse  et  al., 
2015).  Thus,  we  did  not  predict  a  significant  relationship  between  class¬ 
room  beliefs  endorsing  retaliation  in  the  fall  and  students’  proactive 
aggression  in  the  spring. 

Classroom  Proactive  Aggression  and  the 
Advantage  of  Power 

Two  lines  of  reasoning  offer  divergent  hypotheses.  On  one  hand,  the 
elevated  status  enjoyed  by  proactive  aggressors  (Pellegrini  et  al.,  201 1)  is 
a  compelling  reward.  Past  success  derived  from  aggressive  strategies  may 
encourage  continued  high  rates  of  proactive  aggression  (Sentse  et  al., 
2015),  particularly  if  competition  for  dominance  continues  (Garandeau  & 
Cillessen,  2006).  On  the  other  hand,  the  establishment  of  a  favorable 
dominance  hierarchy  at  the  beginning  of  the  year  may  enable  some 
proactive  aggressors  to  desist  (Faris  &  Felmlee,  2011).  Mindful  of  peer 
support  (Veenstra  et  al.,  2010),  they  may  prefer  to  dispense  selective 
kindness  and  foster  strategic  alliances  (Hawley,  2003;  Pellegrini  et  al., 
2011;  Roseth  et  al„  201 1).  Such  a  shift  in  direction  is  easier  to  accomplish 


if  proactive  aggression  confers  protection  against  later  aggression  by 
classmates  (Frey,  Newman,  &  Onyewuenyi,  2014).  These  considerations 
suggest  an  alternative  hypothesis — that  high  classroom  rates  of  proactive 
aggression  in  the  fall  will  contribute  to  low  rates  of  students’  proactive 
aggression  in  the  spring. 

Classroom  Reactive  Aggression  May  Enable 
Opportunistic  Justifications 

Reactive  aggressors  may  be  socially  inept,  enabling  proactive  aggres¬ 
sors  to  manipulate  peer  opinion  (Garanadeau  &  Cillessen,  2006;  Little  et 
al.,  2003)  and  the  social  rejection  of  reactive  aggressors  (Evans  et  al., 
2015).  Peers’  lack  of  sympathy  for  reactive  aggressors  enable  proactive 
aggressors  to  target  them  without  fear  of  losing  peer  support  (Veenstra  et 
al.,  2010).  Thus,  high  levels  of  classroom  reactive  aggression  may  actu¬ 
ally  help  proactive  aggressors  bully  with  impunity.  Another  consideration 
is  that  resistance  to  domination  by  proactive  aggressors  may  take  the  form 
of  reactive  aggression.  If  reactive  aggression  rates  are  indicative  of  wide¬ 
spread  resistance,  proactive  aggressors  may  deem  it  necessary  to  fre¬ 
quently  display  dominance  aggressively.  Both  lines  of  reasoning  suggest 
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that  high  classroom  rates  of  reactive  aggression  in  the  fall  will  contribute 
to  high  rates  of  students’  proactive  aggression  in  the  spring. 

Hypotheses  for  Reactive  Aggression  Reflect 
Impulsivity  and  Peer  Rejection 

Impulsivity  Inhibits  Evaluation  Based 
on  Personal  Beliefs 

Although  reactive  aggressors  may  hold  the  same  beliefs  about 
retaliation  as  their  peers,  they  are  less  likely  to  review  such  mental 
structures  when  deciding  whether  or  not  to  retaliate.  Response 
evaluation  (Step  5)  is  often  cursory  or  omitted  entirely  as  a  result 
of  emotional  dysregulation.  Indeed,  Arsenio  and  Gold  (2006)  find 
that  reactively  aggressive  youth  display  normal  moral  develop¬ 
ment,  but  impulsivity  restricts  their  ability  to  regulate  their  behav¬ 
ior  on  the  basis  of  normative  standards.  Thus,  we  predicted  that 
personal  beliefs  will  not  contribute  to  students’  reactive  aggression 
in  the  spring. 

Expressed  Classroom  Beliefs  May  Fuel 
Arousal  and  Resignation 

In  contrast,  impulsive  youth  appear  to  be  particularly  influenced 
by  social  norms  (Thomas  &  McGloin,  2013),  perhaps  due  to  unmet 
needs  for  social  acceptance  (Evans  et  al.,  2015).  These  youth, 
however,  do  not  necessarily  evaluate  responses  systematically  in 
light  of  classroom  norms.  Classroom  beliefs  and  expectations  may 
be  revealed  in  the  exhortations  of  bystanders  (e.g.,  when  they  are 
eager  to  see  a  fight).  At  such  times,  adolescents  report  emotional 
flooding  and  confusion  (Farrell,  Henry,  Schoeny,  Bettencourt,  & 
Tolan,  2010),  lessening  constraints  that  might  be  provided  by 
personal  norms  or  even  fear  of  the  consequences.  Furthermore, 
targets  of  aggression  may  feel  that  the  classroom  culture  will 
eventually  force  retaliation,  even  if  they  do  not  view  it  as  a 
desirable  response  or  an  effective  deterrent  (Farrell  et  ah,  2010;  de 
Castro,  Verhulp,  &  Runions,  2012).  Thus,  we  predicted  that  class¬ 
room  beliefs  endorsing  retaliation  will  contribute  to  increased  rates 
of  students’  reactive  aggression  later  in  the  spring,  even  after 
accounting  for  prior  aggression  and  personal  beliefs. 

Classroom  Proactive  Aggression  and  Peer  Rejection 
May  Foster  Passivity 

Domination  by  others  is  likely  to  stimulate  high  rates  of  aggres¬ 
sive  reactivity,  at  least  in  the  short-term.  Over  the  school  year, 
however,  successful  intimidation  and  domination  may  result  in 
fear  and  despair  on  the  part  of  frequent  targets.  Resistance  may 
appear  futile  (Craig,  Pepler,  &  Blais,  2007)  even  without  careful 
evaluation  of  costs  and  benefits  in  any  one  instance.  Over  time, 
resignation  and  fear  may  reduce  retaliation  against  more  effective 
aggressors  (Brendgen  et  ah,  2013;  Camodeca  &  Goossens,  2005). 
Therefore,  we  expected  that  high  classroom  rates  of  proactive 
aggression  would  predict  low  rates  of  students’  reactive  aggression 
in  the  spring. 

Classroom  Reactive  Aggression  May  Become  Chronic 

Impulsivity  makes  it  difficult  for  reactively  aggressive  children 
to  evaluate  and  avoid  rough  play  and  arguments  (Frey  et  ah,  2014), 


activities  that  frequently  escalate  to  aggression.  High  fall  rates  of 
reactive  aggression  may  be  indicative  of  classrooms  with  high 
conflict  and  mutual  antagonisms.  High  rates  of  fall  reactive  ag¬ 
gression  may  also  indicate  classrooms  in  which  resistance  to 
domination  has  become  part  of  the  classroom  climate  (Hawley  & 
Williford,  2015).  In  consideration  of  both  possibilities,  we  pre¬ 
dicted  that  high  classroom  rates  of  reactive  aggression  in  the  fall 
would  contribute  to  high  rates  of  students’  reactive  aggression  in 
the  spring. 

Method 

Participants 

This  study  examined  two  waves  of  data  collected  during  a 
randomized  controlled  trial  that  examined  the  effects  of  an  anti¬ 
bullying  program  in  two  midsized  cities  in  the  United  States 
Pacific  northwest  (Frey  et  ah,  2009).  Because  the  intervention  was 
expected  to  change  the  relationship  between  variables,  we  selected 
only  35  classrooms  in  the  three  control  schools  for  the  study. 
School  selection  was  based  on' educators’  willingness  to  be  ran¬ 
domly  assigned  to  either  an  intervention  or  delayed  control  group. 
The  percentage  of  children  receiving  free  and  reduced-price  lunch 
at  the  schools  ranged  from  21%  to  60%.  Adherence  to  ethical 
guidelines  was  strictly  enforced.  Active  parental  consent  was  ob¬ 
tained  for  64%  of  all  students,  and  written  child  assent  was 
obtained  from  students  in  Grades  4-6.  A  total  of  536  students 
were  surveyed  regarding  endorsement  of  retaliation  in  the  fall  and 
spring.  Fall  surveys  from  this  sample  were  used  to  calculate  means 
for  classroom  beliefs. 

Ten  children  (usually  five  girls  and  five  boys)  in  each  classroom 
were  randomly  selected  for  playground  observation.  Of  an  initial 
total  of  284  in  the  observation  sample,  30  children  left  their 
schools  during  the  study  (10.6%)  and  5  (1.8%)  were  excluded  for 
incomplete  observations  (less  than  35  min  at  both  Time  1  and 
Time  2).  T-tests  comparing  fall  rates  of  proactive  and  reactive 
aggression  in  the  retained  sample  with  those  lost  to  attrition 
indicated  no  significant  differences.  The  final  observed  sample 
(n  =  254;  122  girls)  was  10.2%  African  American,  9.4%  Asian 
American,  68.9%  European  American,  9.4%  Hispanic  American, 
and  2.0%  Indigenous  American.  Students  who  spoke  English  as  a 
second  language  comprised  11.0%  of  the  sample,  although  ob¬ 
served  conversations  were  exclusively  in  English.  Students  re¬ 
mained  in  the  same  classroom  every  day. 

Data  Collection  Procedures 

Endorsement  of  retaliation.  The  Endorsement  of  Retaliation 
Survey  was  used  to  measure  personal  and  classroom  beliefs.  It  is 
a  group-administered  four- point  Likert  scale  (1  =  Do  not  agree-, 
4  =  Agree  a  lot).  An  initial  array  of  18  items  included  ones  that 
examined  endorsement  of  indirect  and  relational  aggression  (Crick 
&  Grotpeter,  1995),  as  well  as  direct  physical  and  verbal  aggres¬ 
sion  (Huesmann  &  Guerra,  1997).  After  pretesting  with  126  stu¬ 
dents,  eight  items  with  strong  factor  loadings  (four  each  for  direct 
and  indirect  retaliation)  were  administered  to  309  students  in 
Grades  3-6.  This  yielded  factor  loadings  ranging  from  .50— .60, 
except  for  one  item  with  a  loading  of  .23.  That  item  was  dropped, 
forming  a  scale  with  four  items  specifying  direct  retaliation  and 
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three  specifying  indirect  retaliation.  Examples  of  final  items  in¬ 
clude,  “It’s  okay  to  hit  someone  who  hits  you  first,”  “It’s  okay  to 
say  something  mean  to  someone  who’s  pushing  you  around,”  “It’s 
okay  to  say  something  mean  about  someone  who  says  something 
mean  about  you,  and  It’s  okay  to  stop  talking  to  someone  to  get 
even.”  The  scale  showed  moderate  10- week  stability,  r(289)  = 
.36,  p  <  01.  Internal  consistency  in  the  current  study  was  .87. 

Observation  overview.  Youth  were  observed  on  playgrounds 
from  late  September  through  December,  and  again  from  late 
March  through  June.  Morning  recess  was  15  min  long,  and  after¬ 
noon  recess  was  typically  25  min  long,  enabling  five  focal  child 
observation  periods.  Typically,  each  student  was  observed  once  a 
week  for  5  min  over  10  to  12  weeks,  with  the  order  of  observation 
determined  randomly  each  week.  This  schedule  provided  enough 
repetitions  (M  =  10,  range  7—12)  to  reduce  censoring  effects 
(Stoolmiller,  Eddy,  &  Reid,  2000)  and  staggered  sampling  times 
over  the  broadest  range  of  conditions  (e.g.,  preholiday,  rainy 
weather).  When  conversations  required  extended  proximity,  ob¬ 
servers  reduced  reactivity  by  periodically  shifting  positions  and 
continuously  keeping  the  children  in  sight  with  direct  or  peripheral 
vision.  Children  were  minimally  reactive,  commenting  that  the 
observers  “don’t  do  anything.” 

The  coding  manual  was  used  in  conjunction  with  custom- 
programmed  handheld  devices.  Coding  in  real  time  (Sackett, 
1977),  observers  opened  multiple  screens  in  a  response-contingent 
order,  thereby  reducing  operator  error.  Screen  1  identified  the 
actor.  Screen  2  identified  aggressive,  nonaggressive,  or  bystander 
behaviors.  Indicating  aggression  on  Screen  2  automatically  led  to 
Screen  3  for  coding  aggressive  function. 

Observer  training.  Of  15  paid  observers  who  were  blind  to 
the  specific  purpose  of  the  study,  14  successfully  completed 
training.  Training  Phase  1  (200  hr)  covered  ethical  guidelines, 
operational  definitions,  borderline  decision-making,  error  cor¬ 
rection,  and  data  collection  protocols.  Trainees  coded  video¬ 
taped  playground  behaviors  and  received  immediate  feedback. 
In  order  to  advance  to  in  vivo  coding,  each  coder  had  to  agree 
with  a  master  coder  with  an  overall  minimum  mean  kappa  of  .70 
on  videotapes.  Training  Phase  2  (40-50  hr  of  playground 
coding)  allowed  time  for  children  to  habituate  to  the  presence  of 
observers.  For  at  least  eight  hours,  each  observer  coded  simul¬ 
taneously  with  a  master  coder,  receiving  feedback,  and  review¬ 
ing  discrepancies.  Prior  to  spring  data  collection,  coders  under¬ 
went  20  hr  of  booster  training  before  advancing  to  Training 
Phase  2.  In  fall  and  spring,  data  collection  started  after  kappas 
averaged  at  least  .70.  To  prevent  decay,  master  coders  per¬ 
formed  agreement  checks  (15%  of  sessions,  n  =  868)  on  coder 
accuracy  throughout  data  collection. 

Observer  accuracy.  Two  coders  were  said  to  agree  on  a 
behavior  when  they  indicated  the  same  code  within  one  second  of 
each  other.  Agreement  of  qualifier  codes  for  aggressive  function 
was  event-  rather  than  time-dependent.  Consistent  with  the  exact¬ 
ing  training,  percentage  agreements  were  above  89%  and  overall 
kappa  was  excellent  (k  =  .80).  Kappas  were  also  excellent  when 
computed  for  separate  behaviors  (reported  below)  despite  typically 
low  levels  found  when  infrequent  events  diverge  from  0.5  (Xu  & 
Lorber,  2014). 

Aggression  (k  =  .76)  was  coded  when  a  focal  child  hurt  a 
peer  with  physical  acts,  threats,  exclusion,  or  demeaning  com¬ 
ments  and  gestures.  Aggression  included  both  face-to-face  en¬ 


counters  (e.g.,  punching;  saying,  “You  can’t  sit  here.”)  and 
actions  out  of  the  target’s  direct  awareness  (e.g.,  eye-rolling 
behind  a  target’s  back,  derogatory  gossip,  plotting  to  exclude  or 
otherwise  harm  a  person).  Both  conversation  content  and  non¬ 
verbal  cues  (e.g.,  significant  looks  in  the  direction  of  the  target) 
were  used  to  identify  derogatory  gossip.  Verbal  statements 
could  be  coded  as  aggressive  based  on  derogatory  content  even 
if  the  speaker  was  smiling  or  laughing.  Similarly,  coders  dis¬ 
tinguished  between  aggression  and  rough  play.  The  latter  was 
accompanied  by  mutual  felt  smiles  and  laughter.  Bouts  of  rough 
play  sometimes  devolved  into  aggression,  accompanied  by 
shifts  from  positive  to  negative  expressions  on  the  part  of  at 
least  one  participant. 

Aggressive  function  (k  =  .80)  was  coded  as  proactive  if  initiated 
without  any  provocation  apparent  during  the  5-min  coding  period. 
Aggression  was  coded  as  reactive  if  it  appeared  to  be  an  impulsive 
retaliatory  response,  occurred  during  a  disagreement  (e.g.,  dispute 
about  whether  a  player  was  safe),  or  immediately  followed  other 
types  of  provocation  such  as  cutting  in  line. 

Computation  of  Classroom  Beliefs  and  Classroom 
Rates  of  Aggression 

Although  the  study  predicts  the  rates  of  students’  aggression  in 
the  spring  with  a  minimum  of  35  min  of  observation  time  in  both 
spring  and  fall,  scores  from  the  entire  classroom  were  used  to 
calculate  means  for  classroom  beliefs.  Mean  classroom  beliefs 
were  based  on  a  minimum  of  seven  cases  (M  =  16.21  cases,  total 
N  =  584). 

In  multilevel  analyses,  tests  of  fixed  parameter  coefficients  are 
relatively  robust  when  there  is  a  small  average  number  of  cases  per 
classroom  (e.g.,  4  to  5).  Power  to  detect  interactions  of  classroom- 
and  student-level  variables,  however,  may  be  limited  (Snijders, 
2005).  Therefore,  in  order  to  include  the  greatest  number  of 
classmates  when  calculating  mean  rates  of  classroom  proactive 
and  reactive  aggression,  we  included  students  with  at  least  25  min 
of  observed  fall  behavior  (n  =  283).  It  would  typically  require  5  or 
6  weeks  to  collect  25  min  of  observation.  Given  the  3-week 
observer  habituation  period  prior  to  data  collection,  8  weeks  of 
recess  participation  (two  thirds  of  the  total)  was  probably  the 
minimum  time  these  students  were  interacting  with  classmates.  All 
classrooms  yielded  a  minimum  of  seven  observation  cases  (M  = 
8.72),  and  fall  mean  rates  were  based  on  a  total  of  15,077  min  of 
observation  time. 

Data  Analyses 

Individuals  were  nested  within  classroom,  potentially  violat¬ 
ing  the  assumption  of  independence  in  group  variance  esti¬ 
mates.  Initial  unconditional  models  validated  the  need  for  mul¬ 
tilevel  modeling,  and  subsequent  analyses  evaluated  competing 
nested  models  of  the  relationships  between  fall  multilevel  pre¬ 
dictors  and  spring  rates  of  individual  proactive  and  reactive 
aggression.  The  student-level  covariates  were  gender  and  fall 
rates  of  proactive  and  reactive  aggression.  Two-grade  class¬ 
rooms  accounted  for  21.1%  of  the  class  sample.  Therefore, 
grade  level  was  treated  as  a  two-level  classroom  covariate 
(Grades  3-4  and  Grades  5-6).  Preliminary  analyses  found  no 
significant  interactions  of  gender  or  grade  level,  confirming 
their  appropriateness  as  covariates. 
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Evaluation  of  three  competing  models.  Predictor  variables 
were  grand-mean  centered.  Models  were  tested  with  mixed  models 
(SPSS  19)  using  full-information  maximum-likelihood  estimation. 
Models  included  random  effects  of  the  intercept,  and  competing 
models  were  evaluated  using  Akaike’s  information  criterion  (AIC) 
and  likelihood  ratio  tests. 

Results 

Descriptive  Statistics 

Mean  total  observation  times  were  49.4  min  per  student  in  the 
fall  and  49.5  min  in  the  spring.  Correlations  between  hourly  rates 
of  students’  proactive  aggression,  hourly  rates  of  students’  reactive 
aggression,  and  students’  personal  beliefs  endorsing  retaliation  for 
the  observed  sample  are  shown  in  Table  1.  Those  for  the  fall  are 
below  the  diagonal,  those  for  spring  above  it.  Autocorrelations 
between  times  are  shown  on  the  diagonal.  Correlations  between 
students’  proactive  and  reactive  aggression  were  similar  to  those 
found  in  other  observational  studies  (Card  &  Little,  2006;  Polman 
et  al.,  2007).  Personal  beliefs  endorsing  retaliation  showed  the 
greatest  stability  between  fall  and  spring.  Stability  (r  =  .67)  and 
mean  endorsement  of  beliefs  (fall  M  =  0.79,  SD  =  0.77;  spring 
M  =  0.82,  SD  =  0.81)  in  the  entire  sample  ( N  —  536)  were 
virtually  identical  to  those  in  the  observed  subsample,  and  also 
showed  no  mean  change  from  fall  to  spring  (rs  <  1.42,  ns).  Mean 
rates  for  students’  reactive  aggression  were  significantly  higher 
those  for  students’  proactive  aggression  in  both  fall,  f(253)  =  6.30, 
p  <  .01,  T|2  =  .22,  and  spring,  r(253)  =  4.05,  p  <  .01,  rf  =  .16, 
despite  a  spring  increase  in  students’  proactive,  t(253)  =  3.02,  p  < 
.01,  T|2  =  .03,  but  not  reactive,  aggression,  t(253)  =  0.08,  ns. 

Unconditional  Models 

Unconditional  random-effects  models  were  fitted  to  decompose 
the  variance  in  observed  proactive  and  reactive  aggression  into  that 
due  to  differences  across  classrooms,  and  differences  across  indi¬ 
viduals  nested  within  classrooms.  In  both  cases,  most  of  the 
predictable  variance  was  found  at  the  individual  level.  Classroom 


Table  1 


Correlation  Matrix  and  Mean  Values  for  Student  Aggression 
and  Personal  Beliefs 


Correlations 

Spring  values 

Measure 

PA 

RA 

Beliefs 

M 

SD 

PA 

.21 

.44 

.17 

\.\T 

2.10 

RA 

.34 

.32 

.06 

1.83a 

2.72 

Beliefs 

Fall  values 

.07 

.19 

HI 

.81 

.79 

M 

.75a 

1.85a 

.76 

SD 

1.29 

2.94 

.77 

Note.  N  =  254.  Cells  below  the  diagonal  provide  fall  intercorrelations. 
Cells  above  the  diagonal  provide  spring  intercorrelations.  Diagonal  cells 
(underlined)  provide  fall-spring  autocorrelations.  Correlations  significant 
at  p  <  .01  are  in  bold.  Correlations  significant  at  p  <  .05  are  in  italics. 
PA  =  rate  of  proactive  aggression;  RA  =  rate  of  reactive  aggression; 
Beliefs  =  personal  beliefs. 
a  rate  per  hour. 


accounted  for  17.2%  of  the  variance  in  students’  proactive  aggres¬ 
sion  (intraclass  correlation  coefficient  [ICC]  =  .172,  Wald  Z  = 
2.47,  p  <  .05)  and  10.1%  of  the  variance  in  students’  reactive 
aggression  (ICC  =  .101,  Wald  Z  =  1.80,  p  <  .10).  Since  Wald  Z 
is  a  relatively  insensitive  test  (Hauck  &  Donner,  1977),  the  re¬ 
maining  analyses  nested  subjects  in  classrooms  for  both  variables. 

Multilevel  Links  Between  Fall  Norms  and 
Students’  Aggression 

Students’  proactive  aggression  in  spring.  The  first  model 
evaluated  the  contribution  of  personal  beliefs  to  hourly  rates  of 
students’  proactive  aggression  in  the  spring  after  accounting  for 
the  contributions  of  gender,  grade  level,  and  fall  rates  of  students’ 
proactive  and  reactive  aggression.  Consistent  with  our  hypothesis, 
personal  beliefs  predicted  high  rates  of  students’  proactive  aggres¬ 
sion  (Table  2). 

The  addition  of  mean  classroom  beliefs  in  Model  2  did  not 
improve  model  fit.  Model  3  added  classroom  rates  of  proactive  and 
reactive  aggression.  The  AIC  and  —2  log  likelihood  values  (dif¬ 
ference  =  6.40,  df  =  2,  p  <  .05)  indicated  that  Model  3  yielded  a 
better  fit.  As  hypothesized,  high  rates  of  classroom  reactive  ag¬ 
gression  in  the  fall  predicted  high  levels  of  students’  proactive 
aggression  in  the  spring.  In  contrast,  high  classroom  rates  of 
proactive  aggression  in  the  fall  predicted  low  rates  of  students’ 
proactive  aggression  in  the  spring.  Even  with  the  addition  of 
classroom  aggression,  personal  beliefs  remained  significant  pre¬ 
dictors  of  students’  proactive  aggression.  Random  effects  for  class¬ 
room  remained  a  significant  source  of  variance,  Wald  Z  =  1.97, 
p  <  .05,  after  accounting  for  beliefs  and  aggression. 

Student’s  reactive  aggression  in  spring.  As  expected,  per¬ 
sonal  beliefs  were  not  related  to  students’  reactive  aggression  in 
the  spring  after  accounting  for  the  covariates  (Table  3).  As  pre¬ 
dicted,  Model  2  indicated  that  classroom  beliefs  predicted  later 
high  rates  of  students’  reactive  aggression. 

The  addition  of  classroom  rates  of  proactive  and  reactive  ag¬ 
gression  improved  fit  over  Model  2,  as  indicated  by  the  AIC 
and  -2  log  likelihood  values  (difference  =  8.96,  df  -  2,  p  <  .05). 
Model  3  indicated  that  classroom  rates  of  reactive  aggression  in 
the  fall  had  a  positive  relationship  to  students’  reactive  aggression 
in  the  spring.  Classroom  rates  of  proactive  aggression,  on  the 
contrary,  had  a  negative  relationship  to  students’  reactive  aggres¬ 
sion  in  the  spring.  The  contribution  of  classroom  beliefs  declined 
to  marginal  significance  with  the  addition  of  classroom  rates  of 
aggression.  The  classroom  variance  remaining  after  accounting  for 
beliefs  and  aggression  was  also  not  significant,  Wald  Z  =  1.97,  ns. 

Generalizability  of  results.  Analyses  indicated  that  results 
were  not  limited  by  gender  or  grade.  Further,  there  were  no 
significant  interactions  between  students’  aggression  rates  in  the 
fall  and  class-level  predictor  variables. 

Discussion  * 

This  study  adds  a  cautionary  note  to  investigations  of  normative 
influences  on  aggression,  since  generalizing  across  proactive  and 
reactive  aggression  may  prove  misleading.  Our  results  demonstrate 
that  normative  beliefs  contribute  to  students’  proactive  and  reac¬ 
tive  aggression  in  varied  ways  that  are  both  theoretically  and 
practically  meaningful.  Personal  beliefs  endorsing  retaliation  pre- 
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Table  2 

Contributions  of  Fall  Aggressive  Beliefs  and  Behavior  to  Hourly  Rates  of  Students’  Proactive  Aggression  in  Spring 


Variable 

Model  1 

Personal  beliefs 

Model  2 

Personal  and  classroom  beliefs 

Model  3 

Beliefs  and  classroom  behavior 

Estimate 

SE 

t  ratio 

Estimate 

SE 

t  ratio 

Estimate 

SE 

t  ratio 

Intercept 

1.45 

.27 

5.31*** 

1.40 

.31 

4.47*** 

1.14 

.30 

3.75*’* 

Girls 

-.42 

.24 

-1.761 

-.42 

.24 

-1.76 

-.40 

.24 

-1.68 

Grade 

.03 

.50 

.06 

.03 

.50 

.06 

.57 

.51 

1.12 

Fall  PA  rates 

.23 

.11 

2.17* 

.23 

.11 

2.17* 

.29 

.11 

2.63’* 

Fall  RA  rates 

.17 

.04 

3.78*** 

.17 

.04 

^  ^g*t**t*>K 

.14 

.05 

3.02** 

Personal  beliefs 

.51 

.17 

2.91*** 

.51 

.17 

.53 

.17 

3.08** 

Classroom  beliefs 

-.91 

.79 

-.37 

-.76 

.75 

-1.01 

Classroom  rates  of  PA 

-1.03 

.42 

-2.43* 

Classroom  rates  of  RA 

.40 

.17 

2.34* 

Random  class  effects 

.64 

.27 

.63 

.27 

.44 

.23 

-2  log  likelihood 

1,026.5 

1,026.4 

1,020.0 

AIC 

1,042.5 

1,044.4 

1,042.0 

Note.  N  —  254.  PA  —  rate  of  proactive  aggression;  RA  =  rate  of  reactive  aggression;  Belief  =  personal  belief;  AIC  =  Akaike  information  criterion 
f  p  <  .10.  *  p  <  .05.  *><.01.  **><.001. 


dieted  students’  proactive  aggression,  whereas  classroom  beliefs 
endorsing  retaliation  predicted  students’  reactive  aggression.  Spe¬ 
cifically,  students  whose  personal  beliefs  strongly  endorsed  retal¬ 
iation  committed  one  additional  act  of  proactive  aggression  per 
hour  in  the  spring,  compared  to  students  who  did  not  endorse 
retaliation.  And  students  in  classrooms  with  strong  support  for 
retaliation  committed  one  additional  act  of  reactive  aggression 
every  20  min  compared  to  students  in  classrooms  with  less  sup¬ 
port. 

In  contrast,  the  contributions  of  descriptive  classroom  norms 
were  similar  for  proactive  and  reactive  aggression.  High  classroom 
rates  of  proactive  aggression  in  the  fall  appeared  to  have  an 
inhibitory  influence.  Students  in  classrooms  with  higher  rates  of 
proactive  aggression  reduced  their  proactive  aggression  in  the 
spring  by  one  act  per  hour,  and  reactive  aggression  by  one  act 
every  40  min,  compared  to  students  in  classrooms  with  lower  rates. 
High  classroom  rates  of  reactive  aggression  made  more  modest 
contributions,  predicting  relative  increases  in  students’  proactive 


and  reactive  aggression  by  slightly  less  than  one  act  every  two 
hours. 

Functional  Relationships  Between  Beliefs 
and  Aggression 

The  varied  contributions  of  each  type  of  normative  belief  are 
consistent  with  functional  analyses  provided  by  SIP  theory  and 
research.  While  it  may  seem  counterintuitive  that  personal  beliefs 
endorsing  retaliation  would  predict  students’  proactive  aggression 
but  not  reactive  aggression,  the  finding  is  in  line  with  investiga¬ 
tions  indicating  that  preemptive  processing  during  reactive  aggres¬ 
sion  interferes  with  response  evaluation  (Step  5  in  the  SIP  model). 
This  diminishes  the  role  of  self-regulatory  processes  (Arsenio  & 
Gold,  2006;  Fontaine  &  Dodge,  2006).  Peer  actions  stemming 
from  normative  beliefs  (Gasser  &  Malti,  2012)  may  have  more 
impact  in  the  heat  of  the  moment  because  their  influence  does  not 
rely  on  response  evaluation.  Peer  encouragement  of  retaliation 


Table  3 


Contributions  of  Fall  Aggressive  Beliefs  and  Behavior  to  Hourly  Rates  of  Students  ’  Reactive  Aggression  in  Spring 


Variable 

Model  1 

Personal  beliefs 

Model  2 

Personal  and  classroom  beliefs 

Model  3 

Beliefs  and  classroom  behavior 

Estimate 

SE 

t  ratio 

Estimate 

SE 

t  ratio 

Estimate 

SE 

t  ratio 

Intercept 

1.48 

.35 

4.28*** 

1.88 

.37 

5.10*** 

1.54 

.36 

4.28*** 

Girls 

.73 

.32 

2.26* 

.77 

.32 

2.37* 

.78 

.32 

2.42* 

Grade 

-.04 

.45 

-.08 

-.97 

.58 

-1.67 

-.22 

.58 

-.38 

Fall  PA  rates 

.27 

.14 

1.87t 

.26 

.14 

1.81 

.35 

.15 

2.37* 

Fall  RA  rates 

.23 

.06 

3.87*** 

.23 

.06 

.20 

.06 

3.20** 

Personal  beliefs 

-.11 

.30 

-.49 

-.22 

.23 

-.92 

-.17 

.23 

-.72 

Classroom  beliefs 

2.13 

.92 

2.31* 

1.60 

.87 

1.85+ 

Classroom  rates  of  PA 

-1.50 

.47 

-3.10** 

Classroom  rates  of  RA 

.44 

.20 

2.19* 

Random  class  effects 

.83 

.42 

.60 

.37 

.33 

.29 

—2  log  likelihood 

1,171.8 

1,166.9 

1,158.0 

AIC 

1,187.8 

1,184.9 

1,180.0 

Note  N  =  254.  PA  =  proactive  aggression;  RA  =  reactive  aggression;  AIC  =  Akaike  information  criterion. 
><.10.  ><.05.  *><.01.  **><.001. 
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creates  a  highly  arousing  context  that  impulsive  children,  those 
most  prone  to  reactive  aggression,  may  find  difficult  to  resist 
(Farrell  et  al.,  2010). 

The  role  of  personal  retaliation  beliefs  in  predicting  increased 
students’  proactive  aggression  is  consistent  with  investigations  of 
self-serving  rationales  that  are  employed  by  young  people  to 
justify  bullying.  Framing  actions  as  a  legitimate  response  to  victim 
behavior  (e.g.,  “He  deserved  it  after  what  he  did  to  me'”)  is  a 
common  element  in  moral  disengagement  (Perren  et  al.,  2012), 
moral  justification  (de  Castro  et  al.,  2012),  and  selective  moral 
engagement  (Hawley  &  Geldhof,  2012).  The  convoluted  logic  that 
adolescents  employ  to  avoid  labeling  aggression  as  bullying  by 
citing  subsequent  retaliatory  responses  of  victims  (Boyd,  2014) 
also  supports  this  possibility.  Justifications  enable  aggressors  to 
bypass  moral  constraints  in  themselves  and  others.  Unlike  the 
self-regulatory  deficits  linked  to  reactive  aggression,  such  biased 
justifications  (Arsenio  &  Gold,  2006)  do  not  represent  a  failure  to 
evaluate  aggression  (Fontaine  &  Dodge,  2006).  Instead,  biases 
seem  attuned  to  social  goals  of  peer  acceptance  and  social  domi¬ 
nance. 

Surprisingly,  our  findings  show  that  students’  proactive  aggres¬ 
sion  was  only  related  to  concurrent  personal  beliefs  in  the  spring, 
while  students’  reactive  aggression  was  only  linked  to  personal 
beliefs  in  the  fall.  Perhaps  justifications  for  students’  proactive 
aggression  become  more  finely  honed  in  conversations  with 
friends  (Caravita,  Sijtsema,  Rambaran,  &  Gini,  2014),  resulting  in 
greater  belief-behavior  concordance  in  the  spring.  In  contrast, 
individual  differences  in  impulsivity  and  personal  beliefs  may 
contribute  less  to  reactive  aggression  as  the  year  progresses  than 
the  student’s  status  as  a  victim  or  nonvictim. 

Functional  Relationships  Between  Classroom  and 
Student  Rates  of  Aggression 

Students  in  classrooms  with  higher  rates  of  proactive  aggression 
in  the  fall  showed  spring  reductions  in  both  proactive  and  reactive 
aggression,  relative  to  those  in  classrooms  with  low  rates.  This 
might  be  interpreted  in  light  of  presumed  function.  High  classroom 
rates  of  proactive  aggression  may  be  indicative  of  a  group  of 
students  who  are  competing  to  establish  themselves  in  dominant 
positions  within  a  strongly  hierarchical  class  structure.  Behavior 
ecological  analyses  (Pellegrini,  2008)  suggest  that  when  domi¬ 
nance  hierarchies  stabilize,  aggression  diminishes.  Resignation 
and  fear  on  the  part  of  subordinate  classmates  (Camodeca  & 
Goossens,  2005;  Craig  et  al.,  2007)  may  reduce  resistance  to 
structural  inequities.  It  is  also  possible  that  manipulation  and 
subterfuge  on  the  part  of  proactive  aggressors  (Garandeau  & 
Cillessen,  2006;  Little  et  al.,  2003;  Xie  et  al.,  2002)  reduces  the 
likelihood  of  successful  retaliation  or  competition. 

In  contrast,  students  in  classrooms  with  higher  rates  of  reactive 
aggression  in  the  fall  exhibited  relatively  high  rates  of  proactive 
and  reactive  aggression  in  the  spring.  This  could  be  indicative  of 
classrooms  that  were  simply  more  contentious  and  argumentative 
than  average.  In  that  scenario,  reactive  aggression  would  beget 
more  reactive  aggression.  It  might  also  identify  classrooms  in 
which  students  continued  to  aggressively  resist  being  subjugated. 
While  moral  outrage  may  result  in  preemptive  processing  and 
ill-considered  actions,  retaliation  against  perceived  injustice  is 


regarded  as  a  duty  in  some  cultures,  even  when  resistance  may  be 
rationally  considered  to  be  futile  (Frey  et  al.,  2015). 

These  considerations  speak  to  the  role  that  classroom  inequities 
and  injustice  can  play  in  provoking  reactive  aggression.  SIP  mod¬ 
els  clearly  provide  a  framework  for  considering  the  contributions 
of  moral  emotions  and  beliefs  during  response  evaluation.  Arsenio, 
Adams,  and  Gold  (2009)  for  example,  found  that  the  expectation 
that  unprovoked  aggression  will  lead  to  positive  emotions  is 
uniquely  linked  to  proactive  aggression.  Research  has  also  exam¬ 
ined  how  beliefs  that  one  has  been  treated  unfairly  can  be  used  to 
legitimize  aggression  (Perren  et  al.,  2012).  There  has  been  surpris¬ 
ingly  little  attention  to  how  such  beliefs  might  impact  in-the- 
moment  emotional  reactions  that  are  particularly  relevant  for  re¬ 
active  aggression.  Continued  work  to  integrate  work  on  moral 
emotions,  beliefs,  and  aggression  is  needed  (Arsenio  &  Lemerise, 
2004;  Malti  &  Latzko,  2010),  particularly  with  respect  to  process¬ 
ing  characteristics  identified  by  SIP  models. 

Implications  for  Intervention  Efforts 
and  Future  Research 

The  contribution  of  personal  beliefs  endorsing  retaliation  to  later 
students’  proactive  aggression  supports  the  importance  of  address¬ 
ing  normative  beliefs  during  intervention  efforts.  Over  time,  such 
efforts  may  also  indirectly  reduce  reactive  aggression  by  reducing 
endorsement  at  the  classroom  level.  One  strategy  uses  attitudinal 
surveys  coupled  with  personalized  feedback  aimed  at  reducing 
students’  tendency  to  overestimate  peer  support  for  aggression. 
More  research  is  needed  to  evaluate  such  approaches  (see  Henry, 
2008  for  a  discussion  of  the  promise  and  pitfalls). 

Future  research  is  also  needed  to  examine  cultural  influences 
regarding  retaliation.  In  “honor  cultures”  such  as  those  in  the 
southern  United  States,  retaliation  is  regarded  as  essential  for 
personal  safety  and  social  recognition  of  manhood.  In  contrast, 
“dignity  cultures”  such  as  those  in  New  England  proscribe  per¬ 
sonal  retaliation  as  an  indicator  of  poor  self-regulation  and  anti¬ 
social  behavior.  The  magnitude  of  variation  suggests  that  retalia¬ 
tion  norms  may  have  considerable  real-world  significance  for  the 
efficacy  of  intervention  practices  in  different  regions  or  with 
different  populations. 

While  acknowledging  potential  cultural  differences  on  this 
point,  it  appears  that  students  who  are  reactively  aggressive  suffer 
increasing  victimization  over  time  ( Salmi valli  &  Helteenvuori, 
2007).  Thus,  reactive  aggression  appears  inimical  to  student  wel¬ 
fare.  Victimization  depletes  self-regulatory  capacity  (Baumeister, 
DeWall,  Ciarocco,  &  Twenge,  2005)  in  a  population  that  may 
include  a  high  proportion  of  impulsive  youth.  Therefore,  interven¬ 
tion  programs  need  to  address  regulatory  skills  as  well  as  social 
norms  that  support  effective  nonaggressive  responses.  Skill  reme¬ 
diation  may  require  additional  time  to  achieve  reductions  in  reac¬ 
tive  aggression  (Frey  et  al.,  2009)  or  increases  in  teacher  support 
(Hirschstein,  Edstrom,  Frey,  Snell,  &  MacKenzie,  2007). 

Importantly,  interventions  need  to  address  issues  of  justice  in 
order  to  motivate  self-regulation  efforts.  In  line  with  this  thinking, 
cooperation  is  higher  (Pellegrini,  2008)  and  aggression  lower 
(Elgar,  Craig,  Boyce,  Morgan,  &  Vella-Zarb,  2009;  Garandeau, 
Lee,  &  Salmivalli,  2014)  when  resources  and  power  are  equitably 
disti  ibuted  than  in  a  winner  takes  all  context.  Further,  aggression 
is  lower  in  classrooms  whose  teachers  attempt  to  mitigate  status 
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differentials  between  students  (Serdiouk,  Rodkin,  Madill,  Logis,  & 
Gest,  2015).  Both  degree  of  stratification  and  hierarchy  stability 
may  influence  aggression  rates.  Ecological  analyses  indicate  that 
hierarchy  stabilization  enables  social  dominants  to  decrease  ag¬ 
gression  while  still  maintaining  control  (Pellegrini,  2008).  The 
impact  of  hierarchy  stability  on  victims  is  unclear.  A  highly 
inequitable,  yet  stable  hierarchical  structure  may  provide  tempo¬ 
rary  relief  for  some,  while  others  remain  pariahs  (Adler  &  Adler, 
1995).  Students  may  suffer  the  realization  that  their  personal 
security  is  dependent  on  the  whims  of  others. 

A  functional  analysis  integrating  information  about  equality  and 
stability  suggests  specific  hypotheses  that  are  amenable  to  testing, 
given  sufficient  classrooms.  Students’  proactive  aggression  is 
likely  to  be  highest,  for  example,  if  destabilization  of  strongly 
hierarchical  systems  creates  power  struggles.  Extreme  status  ineq¬ 
uities  raise  the  stakes  associated  with  being  a  beneficiary  or 
conversely,  someone  who  bears  the  cost  of  injustice.  Thus,  inter¬ 
ventions  that  appear  to  be  succeeding  in  strongly  hierarchical 
contexts  may  unleash  reactionary  forces  aimed  at  preserving  social 
power  (see  the  meta-analysis  of  Bettencourt,  Charlton,  Dorr,  & 
Hume,  2001).  A  possible  result  in  those  classrooms  may  be  a 
temporary  increase  or  delayed  reduction  in  conflict  and  aggression. 
These  scenarios  highlight  the  need  for  intervention  studies  that 
measure  outcomes  more  frequently  during  the  course  of  interven¬ 
tion. 

Limitations  and  Strengths 

Several  limitations  should  be  noted.  We  did  not  measure  actual 
peer  influence  or  perceptions  of  peers.  Although  we  relied  on 
functional  analyses  to  guide  our  hypotheses,  without  knowing 
individuals’  relative  influence  or  the  classroom  hierarchical  struc¬ 
ture,  our  conclusions  are  necessarily  speculative.  In  addition,  high- 
status  classmates  are  more  influential  than  a  random  subsample 
selected  for  observation  (Cohen  &  Prinstein,  2006).  Perceived 
popularity  may  be  especially  important  in  middle  school,  when 
students  interact  with  many  students  in  multiple  classrooms.  In  that 
case,  they  often  have  incomplete  information  about  fellow  stu¬ 
dents,  and  attending  to  the  behavior  exemplified  by  high-status 
individuals  (Paluck  &  Shepherd,  2012)  may  be  an  efficient  way  of 
inferring  behavioral  norms.  Thus,  our  results  may  apply  primarily 
to  students  who  remain  in  one  classroom  and  are  quite  familiar 
with  the  status  and  aggressive  characteristics  of  their  classmates. 

An  additional  caution  regarding  generalizability  is  warranted 
due  to  the  small  number  and  restricted  location  of  schools  in  this 
study.  We  cannot  speak  to  how  personal  beliefs  and  actions  com¬ 
mon  at  the  school-level  contribute  to  later  aggression.  Nor  can  we 
address  the  possibility  that  regions  that  ascribe  greater  importance 
to  retaliation  (Cohen,  2001)  might  yield  different  results. 

We  also  do  not  know  how  applicable  our  results  are  when  finer 
distinctions  are  made  within  these  specific  types  of  aggression. 
Bailey  and  Ostrov  (2008)  found  that  associations  between  young 
adults’  aggressive  beliefs  and  aggressive  behavior  varied  by  both 
aggressive  form  and  function.  Proactive  relational  aggression,  for 
example,  might  be  less  effective  than  proactive  physical  aggres¬ 
sion  at  inhibiting  retaliation.  It  is  also  possible  that  the  relationship 
between  aggressive  form  and  function  is  not  orthogonal,  such  that 
impulsive,  reactive  responding  is  more  strongly  correlated  with 


overt  aggression  than  with  relational  aggression  (e.g.,  Frey  et  al., 
2014). 

These  limitations  are  juxtaposed  against  the  clarity  of  precise 
hourly  rates  of  aggression  provided  by  observers  who  were  blind 
to  hypotheses.  Research  has  been  hampered  by  measurement  prob¬ 
lems  such  as  the  very  high  correlations  between  teacher  ratings  of 
proactive  and  reactive  aggression  common  in  much  past  research. 
This  problem  has  led  to  the  conclusion  in  two  meta-analyses  (Card 
&  Little,  2006;  Polman  et  al.,  2007)  that  trained  observers  are 
better  able  to  distinguish  these  two  types  than  are  untrained  infor¬ 
mants.  Observer  expertise,  combined  with  theoretically  coherent 
links  between  student-  and  classroom-level  aggressive  beliefs  and 
behavior,  is  a  significant  strength  of  this  study.  Further  research 
into  the  functional  significance  of  such  beliefs  and  specific  types 
of  aggression  in  school  settings  will  contribute  to  more  effective 
intervention  strategies.  Particular  benefits  may  derive  from  prac¬ 
tices  that  respond  to  the  need  for  justice  that  can  motivate  reactive 
aggression,  and  those  that  provide  alternate  sources  of  satisfaction 
for  those  motivated  by  social  dominance. 
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Empirical  studies  have  demonstrated  that  students  who  are  taught  in  a  group  of  students  with  higher 
average  achievement  benefit  in  terms  of  their  achievement.  However,  there  is  also  evidence  showing  that 
being  surrounded  by  high-achieving  students  has  a  negative  effect  on  students’  academic  self- 
concept,  also  known  as  the  big-fish-little-pond  effect.  In  view  of  the  reciprocal  relationship  between 
achievement  and  academic  self-concept,  the  present  study  aims  to  scrutinize  how  the  average 
achievement  of  a  class  affects  students’  achievement  and  academic  self-concept,  and  how  that,  in  turn, 
affects  subsequent  achievement  and  academic  self-concept.  Using  a  sample  of  6,463  seventh-graders 
from  285  classes  in  Germany,  multilevel  path  models  showed  that  the  class-average  achievement  at  the 
beginning  of  the  school  year  positively  affected  individual  achievement  in  the  middle  and  at  the  end  of 
the  school  year,  and  negative  effects  on  academic  self-concept  occurred  only  at  the  beginning  of  Grade 
7,  but  not  later  in  the  school  year.  In  addition,  mediation  analyses  revealed  that  the  effects  of 
class-average  achievement  on  students’  achievement  and  academic  self-concept  at  the  end  of  the  school 
year  were  mediated  by  midterm  achievement,  but  not  by  midterm  academic  self-concept.  This  pattern 
was  found  for  mathematics,  biology,  physics,  and  English  as  a  foreign  language.  The  results  of  our  study 
indicate  that  the  consequences  for  students  of  belonging  to  a  group  of  high-achieving  students  should  be 
analyzed  with  respect  to  both  academic  self-concept  and  achievement. 

Keywords:  compositional  effects,  big-fish-little-pond  effect,  academic  self-concept,  reciprocal  effects 
model 


Whether  and  how  students  are  influenced  by  their  class-  and 
schoolmates,  is  a  perpetual  topic  in  education.  There  is  a  popular 
belief  that  it  is  better  for  students  to  be  surrounded  by  smart  and 
high-achieving  peers.  And  in  fact,  there  are  a  number  of  studies 
showing  that  the  average  achievement  level  of  a  class  or  school  has 
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a  positive  effect  on  subsequent  individual  achievement  over  and 
above  students’  prior  achievement  (Bums,  &  Mason,  2002;  Ha- 
nushek,  Kain,  Markman,  &  Rivkin,  2003;  Marks,  2010;  Opdenak- 
ker,  van  Damme,  de  Fraine,  van  Landeghem,  &  Onghena,  2002). 
However,  there  is  also  research,  in  particular  from  educational 
psychology,  showing  that  being  in  a  high-achieving  class  or  school 
can  have  detrimental  effects  on  other  outcomes,  such  as  educa¬ 
tional  aspirations,  academic  interests,  or  academic  self-concept 
(for  an  overview,  see  Marsh  et  al.,  2008).  Numerous  studies  on  the 
so-called  big-fish-little-pond  effect  (BFLPE;  Marsh,  1987)  have 
demonstrated  that  students  with  the  same  achievement  level  have 
a  lower  academic  self-concept  when  they  are  in  a  class  of  high- 
achieving  students  than  when  they  are  in  a  class  of  low-achieving 
students  (for  an  overview,  see  Marsh  et  al.,  2008).  Being  sur¬ 
rounded  by  high-achieving  peers  thus  has  different  effects  depend¬ 
ing  on  the  outcome:  It  has  a  positive  effect  on  students’  achieve¬ 
ment  but  a  negative  effect  on  their  academic  self-concept.  Given 
that  schools  should  foster  both  students’  achievement  as  well  as 
their  motivational  development,  it  is  important  to  investigate  how 
the  average  achievement  of  a  group  affects  students’  academic 
development  both  in  terms  of  their  achievement  and  their  aca¬ 
demic  self-concept.  To  our  knowledge,  a  simultaneous  analysis  of 
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these  two  opposite  effects  of  class-  or  school-average  achievement 
has  not  been  conducted.  This  is  all  the  more  surprising  considering 
that  academic  self-concept  and  achievement  are  reciprocally  asso¬ 
ciated  with  each  other  on  the  individual  level.  That  is,  having  a 
high  academic  self-concept  has  positive  effects  on  subsequent 
achievement  and  vice  versa  (Marsh  &  Martin,  2011). 

Therefore,  the  goal  of  the  present  study  is  to  scrutinize  the  interplay 
between  the  effect  of  class-average  achievement  on  achievement  and 
the  effect  of  class-average  achievement  on  academic  self-concept. 
More  specifically,  using  a  longitudinal  dataset  with  three  measure¬ 
ment  points,  the  study  analyzes  how  the  average  achievement  of  a 
class  affects  students’  achievement  and  academic  self-concept,  and 
how  that,  in  turn,  affects  subsequent  achievement  and  academic 
self-concept.  Metaphorically  speaking,  taking  into  account  both 
achievement  development  and  academic  self-concept,  is  a  fish  better 
off  swimming  in  a  “big  pond”  or  in  a  “little  pond”? 

Before  describing  our  own  study  in  more  depth,  we  describe 
empirical  findings  from  different  research  strands  that  are  brought 
together  in  our  research  question:  First,  we  discuss  previous  research 
on  the  positive  effects  of  average  achievement  on  students’  individual 
achievement,  which  is  referred  to  in  the  literature  using  the  term 
compositional  effects.  Second,  we  turn  to  the  negative  effects  of 
average  achievement  on  students’  academic  self-concept,  known  as 
the  BFLPE.  Third,  we  lay  out  the  reciprocal  relationship  between 
achievement  and  academic  self-concept  on  the  individual  level,  which 
has  also  been  called  the  reciprocal  effects  model.  Fourth  and  finally, 
we  present  findings  from  the  few  empirical  studies  that  have,  like 
ours,  looked  at  both  the  effects  of  average  achievement  on  students’ 
individual  achievement  and  academic  self-concept. 

Positive  Effects  of  Average  Achievement  on  Students’ 
Individual  Achievement 

Since  at  least  the  publication  of  the  Coleman  report  (Coleman  et 
al.,  1966),  educational  researchers  have  been  investigating  whether 
the  composition  of  a  class  or  school  influences  the  individual 
development  of  students.  Just  as  each  child  is  unique,  there  are  a 
lot  of  differences  between  the  student  bodies  of  classes  or  schools. 
In  most  cases,  students  are  not  randomly  assigned  to  a  classroom 
or  a  school,  which  leads  to  systematic  differences  between  them. 
On  the  one  hand,  the  composition  of  a  group  of  students  is 


determined  by  the  population  of  the  district  or  neighborhood  in 
which  the  school  is  located  (i.e.,  implicit  tracking ;  Hallinan,  1994; 
Trautwein,  Roller,  Liidtke,  &  Baumert,  2005).  Moreover,  in  many 
educational  systems,  students  are  allocated  to  different  school 
tracks  or  ability  groups  according  to  achievement  level  (i.e.,  ex¬ 
plicit  tracking  or  ability  grouping ;  Ireson  &  Hallam,  2001;  Oakes, 
1985;  Trautwein  et  al.,  2005).  Therefore,  schools  or  classes  can  be 
characterized  by  a  certain  composition  with  respect  to  student 
characteristics  such  as  achievement,  socioeconomic  status  (SES), 
or  ethnicity  (see  van  Ewijk  &  Sleegers,  2010).  A  compositional 
effect  is  thus  the  effect  of  the  school  or  classroom  composition  on 
students’  individual  achievement  over  and  above  the  effects  of 
student  characteristics  at  the  individual  level  (Harker  &  Tymms, 
2004).  Figure  1  shows  the  path  model  of  the  effect  of  school/class- 
average  achievement  on  individual  achievement  after  controlling 
for  prior  individual  achievement,  which  is  the  focus  of  the  present 
article  (for  readability,  we  also  refer  to  this  effect  as  the  “compo¬ 
sitional  effect  on  achievement”). 

Indeed,  there  is  empirical  evidence  that  belonging  to  a  group  of 
high  achievers  has  a  positive  effect  on  the  development  of  stu¬ 
dents’  individual  achievement,  whereas  being  surrounded  by  low- 
achieving  peers  usually  has  a  negative  effect  (e.g.,  Hanushek  et  al., 
2003;  Marks,  2010).  One  explanation  for  such  effects  concerns  the 
reciprocal  influence  of  students  on  one  another  in  areas  that  are 
associated  with  achievement,  such  as  motivation  or  learning  effort 
(Flarker  &  Tymms,  2004).  Another  mechanism  by  which  compo¬ 
sitional  effects  can  be  explained  is  teachers  adapting  their  instruc¬ 
tion  to  the  composition  of  a  group  of  students,  for  example,  by 
providing  cognitively  more  demanding  instruction  for  high- 
achieving  groups  or  by  covering  more  subject  matter  in  the  same 
amount  of  time  than  they  would  in  a  lower-achieving  class  (Dree- 
ben  &  Barr,  1988;  Harker  &  Tymms,  2004;  Harris,  2010).  Addi¬ 
tionally,  teachers  may  also  have  higher  expectations  for  students  in 
high-achieving  classes,  which  can  have  a  positive  effect  on  stu¬ 
dents’  actual  performance  (Harker  &  Tymms,  2004;  Jussim  & 
Harber,  2005).  In  general,  compositional  effects  have  been  found 
to  be  larger  when  looking  at  the  composition  of  a  class  rather  than 
that  of  a  school,  as  the  class  is  the  immediate  learning  environment 
to  which  students  belong  (van  Ewijk  &  Sleegers,  2010). 


Figure  1.  Theoretical  path  model  of  the  compositional  effect  of  class-average  achievement  on  individual 
achievement.  T1  —  Time  1,  T2  —  Time  2.  Plus  signs  indicate  a  positive  effect;  minus  signs  indicate  a  negative 
effect. 
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Taken  together,  students  in  high-achieving  classes  may  benefit 
from  the  composition  of  their  classes,  whereas  students  in  lower- 
performing  classes  are  more  likely  to  experience  disadvantages 
with  respect  to  how  much  they  learn.  For  educational  systems  with 
rigid  tracking  practices,  the  compositional  effect  of  average 
achievement  may  result  in  a  systematic  disadvantage  for  low- 
achieving  students,  who  are  typically  allocated  to  a  group  with 
other  low-achieving  students. 

Negative  Effects  of  Average  Achievement  on  Students’ 
Academic  Self-Concept 

Whereas  students  in  high-achieving  classes  benefit  with  respect 
to  their  achievement  development,  there  is  a  large  amount  of 
research  showing  that  belonging  to  a  high-achieving  group  of 
students  has  negative  consequences  for  students’  academic  self- 
concept  (Janssen,  Wouters,  Huygh,  Denies,  &  Verschueren,  2015; 
Jonkmann,  Becker,  Marsh,  Ltidtke,  &  Trautwein,  2012;  Liou, 
2014;  Marsh,  2005;  Marsh  et  al.,  2014;  Marsh  &  O’Mara,  2010; 
Marsh,  Trautwein,  Liidtke,  Baumert,  &  Roller,  2007;  Nagengast  & 
Marsh,  2011;  Roy,  Guay,  &  Valois,  2015;  Seaton,  Marsh,  & 
Craven,  2010;  Wang,  2015;  Wouters,  de  Fraine,  Colpin,  van 
Damme,  &  Verschueren,  2012).  This  phenomenon  is  explained  by 
the  fact  that  people  compare  themselves  to  others  in  their  reference 
group  (social  comparison  theory;  Festinger,  1954).  For  students, 
their  classrooms  and  their  schools  are  their  immediate  reference 
groups,  which  they  use  for  comparisons  to  form  perceptions  about 
their  own  competencies  (Marsh,  1984).  Being  surrounded  by  high- 
achieving  students  provides  more  opportunities  for  “upward”  com¬ 
parisons,  which  weaken  students’  academic  self-concept.  In  con¬ 
trast,  belonging  to  a  low-achieving  group  of  students  will  result  in 
a  more  positive  estimate  of  a  student’s  competencies  due  to  more 
“downward”  comparisons.  This  effect  is  known  as  the  BFLPE, 
defined  as  the  negative  effect  of  average  achievement  on  students’ 
academic  self-concept  while  controlling  for  individual  achieve¬ 
ment  (Marsh,  1987).  Figure  2  shows  the  corresponding  path 
model.  As  the  BFLPE  describes  an  effect  of  average  achievement 
on  an  individual  outcome  over  and  above  individual  achievement, 
it  can  also  be  considered  a  compositional  effect.  However,  in  line 
with  previous  research,  we  refer  to  it  as  BFLPE  or  describe  it  as 
“the  effect  of  class-average  achievement  on  academic  self- 
concept.” 


A  remarkable  number  of  studies  have  focused  on  the  BFLPE. 
Just  in  the  past  decade,  more  than  70  articles  have  been  published 
in  leading  American  Psychological  Association  journals  on  the 
topic.  In  fact,  the  BFLPE  is  probably  one  of  the  best  researched 
phenomena  in  educational  psychology.  Studies  on  the  BFLPE 
have,  for  instance,  demonstrated  its  generalizability  across  differ¬ 
ent  cultures  (Marsh,  Kong,  &  Hau,  2000;  Marsh  et  al.,  2012; 
Nagengast  &  Marsh,  2011;  Seaton,  Marsh,  &  Craven,  2009;  Wang, 
2015)  and  its  replicability  in  different  grade  levels  (for  a  review, 
see  Seaton  &  Craven,  2011;  see  also  Marsh  et  al.,  2008)  and  in 
different  subjects  (Janssen  et  al.,  2015;  Liou,  2014).  Moreover, 
studies  have  shown  that  the  BFLPE  can  influence  academic  self- 
concept  and  other  academic  outcomes,  such  as  achievement,  even 
after  graduation  from  high  school  (e.g.,  Marsh  &  O’Mara,  2010; 
Marsh  et  al.,  2007). 

Typically,  the  BFLPE  is  modeled  in  cross-sectional  studies,  as 
Figure  2  shows  (for  a  review,  see  Marsh  et  al.,  2008).  In  studies 
analyzing  the  BFLPE  in  a  longitudinal  framework  (e.g.,  Marsh, 
Roller,  &  Baumert,  2001;  Marsh  et  al.,  2007),  the  effect  decreases 
when  controlling  for  students’  previous  academic  self-concept. 
Whereas  in  some  of  the  studies,  the  decreased  BFLPE  remains 
relatively  high  (Roller  &  Baumert,  2001;  Marsh  &  O’Mara,  2010), 
in  other  studies,  the  remaining  effect  is  small  (Marsh  et  al.,  2000; 
Roller,  Trautwein,  Liidtke,  &  Baumert,  2006),  and  in  some  studies, 
the  effect  is  no  longer  significant  (Marsh  et  al.,  2001;  Liidtke  & 
Roller,  2002). 

In  sum,  we  can  state  that  in  high-achieving  learning  environ¬ 
ments,  students’  academic  self-concept  declines,  whereas  low- 
achieving  classes  or  schools  can  protect  students’  academic  self- 
concept  because  they  offer  fewer  opportunities  for  upward 
comparisons.  This  pattern  is  the  reverse  of  that  for  the  composi¬ 
tional  effect  on  achievement,  in  which  students  benefit  in  the 
development  of  their  individual  achievement  in  high-achieving 
classes  or  schools. 

The  Reciprocal  Relationship  Between  Achievement 
and  Academic  Self-Concept 

When  investigating  the  effects  of  average  achievement  on  stu¬ 
dents’  achievement  and  their  academic  self-concept,  it  is  important 
to  bear  in  mind  that  achievement  and  academic  self-concept  are 
reciprocally  related  at  the  individual  level.  High  achievement  leads 


Figure  2.  Theoretical  path  model  of  the  big-fish-little-pond  effect.  T1  =  Time  1;  T2  -  Time  2.  Plus  signs 
indicate  a  positive  effect;  minus  signs  indicate  a  negative  effect. 
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to  a  higher  academic  self-concept  (known  as  the  skill-development 
model;  first  mentioned  by  Calsyn  &  Kenny,  1977;  empirically 
tested  by  e.g.,  Skaalvik  &  Hagtvet,  1990),  and  a  high  academic 
self-concept  leads  to  higher  achievement  (known  as  the  self¬ 
enhancement  model;  first  mentioned  by  Calsyn  &  Kenny,  1977; 
meta-analysis  by  Valentine,  DuBois,  &  Cooper,  2004).  The  mutual 
relationship  between  achievement  and  academic  self-concept  at 
the  individual  level,  known  as  the  reciprocal  effects  model  (REM; 
for  an  early  overview,  see  Byrne,  1984;  for  a  current  review,  see 
Marsh  &  Martin,  2011),  is  illustrated  in  Figure  3. 

Empirical  evidence  of  the  REM  has  been  presented  for  different 
grade  levels  (Helmke  &  van  Aken,  1995;  Pinxten,  Marsh,  de 
Fraine,  Van  den  Noortgate,  &  van  Damme,  2014;  Seaton,  Parker, 
Marsh,  Craven.  &  Yeung,  2014),  as  well  as  for  different  subjects 
(mathematics,  e.g.,  Pinxten  et  al.,  2014;  first  language,  e.g.,  Re- 
telsdorf,  Roller,  &  Moller,  2014).  Researchers  have  found  the 
cross-lagged  paths  of  the  REM,  not  only  between  two  measure¬ 
ment  points,  but  also  for  three  and  more  measurement  points 
(Marsh  &  O’Mara,  2008;  Seaton  et  al.,  2014).  With  respect  to  the 
question  of  which  of  the  paths  is  stronger,  the  findings  have  been 
inconsistent.  Some  studies  have  found  achievement  to  have  stron¬ 
ger  effects  on  academic  self-concept  (Helmke  &  van  Aken,  1995; 
van  Damme,  Opdenakker,  de  Fraine,  &  Mertens,  2004),  whereas 
others  have  shown  the  opposite  (Marsh,  Hau,  &  Kong,  2002; 
Marsh,  Trautwein,  Liidtke,  Roller,  &  Baumert,  2005).  Taken  to¬ 
gether,  when  investigating  the  interplay  between  the  compositional 
effect  on  achievement  and  the  BFLPE  over  time,  one  must  take 
into  account  the  paths  between  achievement  and  academic  self- 
concept  at  the  individual  level. 

The  Relationship  Between  the  Effects  of  Average 
Achievement  on  Achievement  and  on  Academic 
Self-Concept 

Looking  at  both  research  on  compositional  effects  on  achieve¬ 
ment  and  research  on  the  BFLPE,  one  can  conclude  that  the 
composition  of  a  student’s  class  or  school  in  terms  of  its  average 
achievement  is  an  important  factor  when  understanding  a  student’s 
academic  development.  However,  depending  on  the  outcome,  one 
may  come  to  opposite  conclusions  with  respect  to  the  question  of 
whether  it  is  beneficial  to  be  in  a  high-achieving  class  or  school.  In 
order  to  develop  a  complete  picture  of  the  effects  of  the  average 
achievement  of  a  class  or  school  on  the  academic  development  of 
students,  a  simultaneous  investigation  of  both  effects  over  time  is 
needed  and,  in  fact,  has  been  proposed  (Chiu,  2012;  Dai  &  Rinn, 
2008;  Wouters  et  al.,  2012). 

Even  though  a  number  of  studies  have  modeled  average 
achievement  effects  on  individual  achievement  and  academic  self- 
concept  (e.g..  Roller  &  Baumert,  200 1 1 ;  Roller  et  al.,  20062),  they 
have  not  explicitly  addressed  the  relative  importance  of  the  com¬ 
positional  effect  on  achievement  versus  the  BFLPE.  For  instance, 
when  they  investigated  the  effects  of  achievement  grouping  from 
Grade  7  to  10,  Roller  and  Baumert  (2001)  found  a  significant 
BFLPE  but  no  compositional  effect  on  achievement  when  control¬ 
ling  for  school  track.  In  another  study  by  Roller  et  al.  (2006)  that 
focused  on  the  interaction  between  achievement,  academic  self- 
concept,  and  interest  between  Grades  10  and  12  in  the  academic 
track,  the  authors  found  significant  effects  of  school-average 
achievement  on  individual  achievement  and  on  academic  self¬ 


concept.  However,  both  of  these  studies  investigated  the  effects  of 
school-average  achievement  on  individual  achievement  and  aca¬ 
demic  self-concept  in  separate  models.  In  a  study  looking  at 
students’  achievement  and  academic  self-concept  simultaneously, 
Marsh  and  O’Mara  (2010)  found  a  negative  effect  of  school- 
average  ability  on  academic  self-concept  but  no  longitudinal  ef¬ 
fects  on  students’  grades  or  their  level  of  education  (which  are 
indicators  of  students’  achievement).  Hence,  all  of  these  studies 
have  specific  limitations  with  regard  to  analyzing  the  relative 
importance  of  the  composition  effect  on  achievement  and  the 
BFLPE.  There  are  further  limitations  related  to  their  unit  of  analy¬ 
sis:  These  studies  focus  on  school  composition,  but  as  the  class  is 
the  most  immediate  reference  group  for  students  (Marsh,  Kuyper, 
Morin,  Parker,  &  Seaton,  2014)  this  may  be  the  more  relevant 
context  to  investigate. 

There  seems  to  be  only  one  study  that  addressed  both  students’ 
individual  achievement  and  their  academic  self-concept  simulta¬ 
neously  when  analyzing  the  effects  of  average  ability  at  the  class- 
level  (Rindermann  &  Heller,  2005).  The  authors  found  that  the 
positive  effects  on  achievement  were  larger  than  the  negative 
effects  on  academic  self-concept.  However,  they  only  looked  at 
two  time  points,  the  sample  was  not  representative  as  it  comprised 
a  small  number  of  classes  and  focused  on  special  classes  for  gifted 
students,  and  it  did  not  focus  on  domain-specific  achievement  or 
academic  self-concept,  but  on  general  ability. 

The  Present  Study 

To  analyze  how  the  class  as  a  learning  environment  affects 
students’  individual  academic  development,  it  is  important  to 
investigate  the  positive  effect  of  class-average  achievement  on 
individual  achievement,  that  is,  the  compositional  effect  on 
achievement,  and  the  negative  effect  of  class-average  achievement 
on  academic  self-concept,  that  is,  the  BFLPE,  simultaneously. 
Therefore,  the  aim  of  the  present  study  is  to  investigate  the 
interplay  between  the  compositional  effect  on  achievement  and  the 
BFLPE  over  time  using  a  longitudinal  design  with  three  measure¬ 
ment  points  (Time  1  [Tlj:  the  beginning  of  Grade  7;  Time  2  [T2]: 
the  middle  of  Grade  7;  Time  3  [T3]:  the  end  of  Grade  7).  More 
specifically,  our  research  question  reads  as  follows:  How  does  the 
average  achievement  of  a  class  at  the  beginning  of  Grade  7  affect 
students’  achievement  and  academic  self-concept  in  the  middle  of 
Grade  7,  and  how  does  that,  in  turn,  affect  achievement  and 
academic  self-concept  at  the  end  of  Grade  7?  We  approach  this 
research  question  by  first  analyzing  the  compositional  effect  on 
achievement,  the  BFLPE,  and  the  reciprocal  relation  between 
individual  achievement  and  academic  self-concept  separately  over 
the  course  of  the  school  year.  We  then  bring  these  different  effects 
together  in  one  model  to  address  our  research  question  (see  Figure 
4  for  the  corresponding  path  model). 

Because  there  is  evidence  about  the  compositional  effect  on 
achievement,  the  BFLPE,  and  the  reciprocal  effects  model  in  a 


1  Drawing  on  data  from  the  BIJU  study,  as  we  do,  Roller  and  Baumert 
(2001)  analyzed  the  consequences  of  attending  different  schools  for  stu¬ 
dents’  academic  self-concept  and  achievement  in  mathematics  from  Grades 
7  to  10. 

2  Drawing  on  data  from  the  BIJU  study,  as  we  do.  Roller  et  al.  (2006) 
analyzed  the  interplay  between  individual  achievement,  academic  self- 
concept,  and  interest  in  Grades  10  to  12. 
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Figure  3.  Theoretical  path  model  of  the  reciprocal  effects  model.  T1  =  Time  1;  T2  =  Time  2.  Plus  signs 
indicate  a  positive  effect;  minus  signs  indicate  a  negative  effect. 


number  of  different  domains/subjects  (for  an  overview  regard¬ 
ing  the  compositional  effect  on  achievement  in  different  do¬ 
mains,  see  Dumont  et  al.,  2013;  for  an  overview  regarding  the 
BFLPE  in  different  domains,  see  Marsh  et  al.,  2008;  REM  in 
mathematics,  e.g.,  Pinxten  et  al.,  2014;  REM  in  first  language, 
e.g.,  Retelsdorf  et  al.,  2014),  the  effects  can  be  assumed  to  be 
domain-unspecific  effects.  In  our  study,  we  focus  on  the  aca¬ 
demic  development  of  students  in  the  domain  of  mathematics  as 
an  illustrative  domain  because  this  area  is  the  focus  of  a 
particularly  large  body  of  studies  on  the  compositional  effect  on 
achievement  and  the  BFLPE  (e.g.,  Harker  &  Tymms,  2004; 
Marsh  et  al.,  2007,  2014)  and  allows  us  to  put  our  findings  into 
context  with  previous  findings.  Mathematics  also  offers  an 
advantage  for  analyzing  social  comparative  processes  and  dif¬ 
ferences  in  competencies  because,  it  has  been  argued,  it  is 
especially  bound  by  the  curriculum  and  it  is  more  clearly 
definable  and  distinct  in  terms  of  its  curricular  content  than 
other  subjects  (Gniewosz,  2010).  In  contrast,  reading  compe¬ 
tencies  are  relevant  in  all  subjects.  To  further  investigate 
whether  the  effects  indeed  represent  domain-unspecific  mech¬ 


anisms  and  do  not  emerge  solely  for  the  selected  domain  of 
mathematics,  we  replicate  the  analyses  for  three  other  subjects, 
namely  biology,  physics,  and  English  as  a  foreign  language. 

Method 

Sample 

The  present  study  draws  on  data  from  the  longitudinal  mul¬ 
ticohort  study  BIJU  (Learning  Processes,  Educational  Careers, 
and  Psychosocial  Development  in  Youth  and  Adolescence), 
which  was  conducted  by  the  Max  Planck  Institute  for  Human 
Development  in  Berlin  and  begun  in  the  school  year  1991/92 
with  a  cohort  of  seventh-grade  students  (for  more  details,  see 
Baumert  et  al.,  1996).  Although  the  study  was  conducted  in  four 
federal  states  in  Germany  (Berlin,  Saxony-Anhalt  [SA], 
Mecklenburg-Western  Pomerania  [MWP],  North  Rhine- 
Westphalia  [NRW]),  we  only  use  data  from  three  states,  as  the 
survey  started  at  a  later  measurement  point  in  Berlin.  We  study 
data  from  the  first  three  measurement  points  of  the  BIJU  study: 


Figure  4.  Path  model  of  the  present  study.  For  readability,  only  the  paths  relevant  to  our  research  question  are 
shown  here.  T1  =  Time  1;  T2  =  Time  2. 
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the  beginning  of  Grade  7  (Tl),  halfway  through  Grade  7  (T2), 
and  the  end  of  Grade  7  (T3).  At  each  measurement  point, 
trained  research  assistants  administered  achievement  tests  on 
different  domain- specific  skills  and  cognitive  skills,  as  well  as 
student  questionnaires,  on  two  consecutive  school  days.  The 
dataset  is  well  suited  for  our  study,  as  it  assesses  complete 
classes  and  comprises  three  measurement  points  for  achieve¬ 
ment  and  academic  self-concept  during  a  timeframe  in  which 
the  classroom  context  does  not  change. 

After  removing  classes  with  fewer  than  10  students  to  assure 
a  reliable  estimate  of  class-average  achievement,  the  final  sam¬ 
ple  consisted  of  6,463  students  from  285  classes  at  151  schools 
(53.2%  girls;  average  age  12.7  years  [SD  =  .66]),  including 
46.8%  students  from  the  academic  track  (59.2%  girls;  average 
age  12.6  [SD  =  0.53])  and  53.2%  students  from  different 
nonacademic  tracks  (47.8%  girls;  average  age  12.8  [SD  = 
0.74]).  In  terms  of  students’  social  background,  3.6%  of  stu¬ 
dents  spoke  a  language  other  than  German  at  home,  and  47.2% 
had  one  or  more  parents  with  a  college  degree.  52.9%  of  the 
students  attended  a  school  in  NRW,  23.3%  in  MWP,  and  23.8% 
in  SA.  The  three  federal  states  differ  in  demographic  charac¬ 
teristics.  The  former  West-German  state  of  NRW  is  one  of  the 
largest  of  all  German  states  and  had  a  very  high  population 
density  in  1991  (514  people  per  km2)  in  contrast  to  the  two 
former  East-German  states  (MWP:  80  people  per  km2,  SA:  138 
people  per  km2;  Statistisches  Bundesamt,  2010).  Moreover,  the 
unemployment  rate  in  the  latter  two  states  was  higher  than  in 
NRW  (MWP:  6.6%;  SA:  8.3%;  NRW:  3.9%;  Statistisches 
Bundesamt,  2010)  and  the  gross  domestic  product  per  citizen 
was  lower  (US  $11,208)  than  in  the  average  West-German  state 
at  the  time  (US  $25,877;  Bundesministerium  fur  Wirtschaft  und 
Energie,  2015). 

Instruments 

Mathematics  achievement.  Students’  mathematical  compe¬ 
tencies  were  measured  using  a  curriculum-validated  standard¬ 
ized  achievement  test.  Each  test  comprised  approximately  30 
items  taken  from  the  second  international  mathematics  studies 
of  the  International  Association  for  the  Evaluation  of  Educa¬ 
tional  Achievement  (SIMS;  Travers  &  Westbury,  1989),  and  a 
study  by  the  Max  Planck  Institute  for  Human  Development 
(Baumert,  Roeder,  Sang,  &  Schmitz,  1986).  The  test  covered 
content  areas  including  geometry,  algebra,  and  arithmetic.  The 
anchor  test  design  (Hambleton  &  Swaminathan,  1989)  enabled 
the  estimation  of  individual  achievement  test  scores  on  a  joint 
metric  for  all  three  assessment  rounds  via  weighted  likelihood 
estimation  IRT-modeling  (Warm,  1989).  More  details  regarding 
the  content  and  the  scaling  of  the  test  can  be  found  in  Roller 
(1998)  and  Roller,  Baumert,  and  Schnabel  (1999).  The  internal 
consistency  of  the  test  was  very  high  at  all  three  measurement 
points,  that  is,  Cronbach’s  alpha  was  greater  than  .80  at  each 
point.  Intraclass  correlation  coefficients  (ICCs)  of  .41  (Tl),  .55 
(T2),  and  .57  (T3)  represent  a  substantial  amount  of  variance  in 
mathematics  achievement  between  classes.  This  can  be  seen  as 
an  indicator  for  the  grouping  of  students  into  different  tracks  by 
achievement,  a  key  feature  of  the  German  secondary  school 
system.  The  high  ICCs  are  an  important  precondition  for  our 


study  as  they  imply  different  learning  environments  concerning 
class-average  mathematics  achievement. 

Mathematics  self-concept.  Students’  academic  self-concept 
in  mathematics  was  assessed  using  five  items  (see  Appendix  A) 
based  on  Jopt  (1978)  and  Jerusalem  (1984).  All  items  required 
students  to  use  a  four-point  Likert  scale  to  indicate  their  agree¬ 
ment  (1  =  strongly  agree  to  4  =  strongly  disagree ).  Cronbach’s 
alphas  were  .83  (Tl),  .89  (T2),  and  .90  (T3),  showing  that  the 
scales  were  reliable  at  all  three  measurement  points.  This  is  in 
line  with  previous  research  showing  high  reliabilities  of  this 
scale  indicated  by  Cronbach’s  alphas  larger  than  .80  (for  the 
mathematics  self-concept  scales,  see  also  Roller,  Daniels, 
Schnabel,  &  Baumert,  2000;  Moller  &  Roller,  1995).  Moreover, 
previous  research  has  shown  a  high  construct  validity  of  this 
scale  indicated  by  high  correlations  of  the  mathematics  self- 
concept  scales  with  grades  and  academic  achievement 
(Baumert,  Schnabel,  &  Lehrke,  1998;  Moller  &  Roller,  1995). 
The  ICCs  were  .07  (Tl),  .06  (T2),  and  .08  (T3).  These  small 
sizes  of  explained  variation  on  the  class  level  are  an  indicator  of 
the  phenomenon  that  students  rate  their  own  competencies  in 
comparison  to  their  class'  members  (see  theoretical  back¬ 
ground). 

Control  variables.  Control  variables  included  gender  (with 
girls  as  the  reference  group),  parents’  socioeconomic  status 
(measured  via  the  highest  Treiman  Index  of  the  family;  the 
Treiman  Index  ranges  from  0  to  100,  with  a  higher  score 
indicating  a  higher  status),  parents’  educational  background 
(measured  via  a  dummy  variable  indicating  whether  at  least  one 
parent  had  a  college  degree),  language  spoken  at  home  (mea¬ 
sured  via  a  dummy  variable  indicating  if  students  did  not  speak 
German  at  home),  school  track3  (as  a  dummy  variable  indicat¬ 
ing  whether  students  were  in  the  academic  track),  and  federal 
state  (as  two  dummy-coded  variables  with  NRW  as  the  refer¬ 
ence  category). 

The  selection  of  the  control  variables  is  based  on  empirical 
research  on  disparities  in  achievement  and  academic  self- 
concept.  There  is  evidence  for  gender  differences  in  both 
achievement  and  academic  self-concept,  depending  on  the  sub¬ 
ject.  In  the  case  of  mathematics,  studies  revealed  higher 
achievement  (Organisation  for  Economic  Co-operation  and  De¬ 
velopment,  2010)  and  academic  self-concept  for  boys  (Marsh  & 
Yeung,  1998;  Skaalvik  &  Skaalvik,  2004).  Family  background 
also  predicts  academic  achievement  (Organisation  for  Eco¬ 
nomic  Co-operation  and  Development,  2010)  and  academic 
self-concept  (Craven  &  Marsh,  2005).  Furthermore,  we  control 
for  school  track  because,  in  the  German  secondary  school 
system,  class-average  achievement  is  partly  confounded  by 
school  track.  As  school  tracks  differ  not  only  with  respect  to 
students’  average  achievement  but  also  in  other  aspects  such  as 
teacher  quality  and  curriculum  (Baumert,  Stanat,  &  Watermann, 


'  The  German  secondary  school  system  is  characterized  by  an  allocation 
of  students  into  different  school  tracks  after  elementary  school  according  to 
their  prior  achievement.  Even  though  there  are  small  differences  between 
the  federal  states  concerning  type  and  the  number  of  different  school 
tracks,  the  main  differentiation  in  all  states  is  between  Gymnasium  as  the 
academic  track  leading  to  the  Abitur  (the  prerequisite  for  university  en¬ 
trance)  and  the  remaining  non-academic  tracks.  This  is  why  we  created  a 
dummy  variable  distinguishing  between  academic  and  non-academic  track. 
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2006),  it  is  necessary  to  control  for  school  track  in  the  specific 
German  context.  We  further  control  for  potential  differences 
between  the  federal  states.  Intercorrelations  of  all  variables 
considered  in  the  analyses  are  presented  in  Appendix  B. 

Statistical  Analyses 

We  analyzed  the  effect  of  class-average  achievement  (CA- 
ACH)  on  students’  individual  achievement  (ACH)  and  aca¬ 
demic  self-concept  (ASC)  in  mathematics  over  time  using  four 
analytical  steps.  First,  we  investigated  the  compositional  effect 
on  achievement.  That  is,  we  analyzed  whether  the  average 
mathematics  achievement  of  classrooms  at  the  beginning  of 
Grade  7  (CA-ACH1)  had  an  effect  on  students’  individual 
mathematics  achievement  in  the  middle  of  Grade  7  (ACH2)  and 
at  the  end  of  Grade  7  (ACH3)  after  controlling  for  their  math¬ 
ematics  achievement  at  the  beginning  (ACH1).  In  the  second 
step,  we  analyzed  the  BFLPE  both  cross-sectionally  (as  done  in 
most  studies  on  the  BFLPE)  for  the  beginning  of  Grade  7  and 
longitudinally  in  the  middle  and  at  the  end  of  Grade  7;  that  is, 
the  effect  of  class-average  mathematics  achievement  at  the 
beginning  of  Grade  7  (CA-ACH1)  on  mathematics  self-concept 
at  three  different  time  points  (ASCI,  ASC2,  ASC3).  In  the  third 
step,  we  estimated  cross-lagged  paths  between  students’  indi¬ 
vidual  mathematics  achievement  and  their  mathematics  self- 
concept  over  the  course  of  all  three  measurement  points.  Fi¬ 
nally,  we  brought  these  different  effects  together  by  modeling 
all  paths  simultaneously.  In  this  model,  we  then  also  estimated 
mediation  effects  for  class-average  mathematics  achievement  at 
T1  on  individual  mathematics  achievement  at  T3,  as  well  as  on 
mathematics  self-concept  at  T3  via  mathematics  self-concept  at 
T2  and  via  individual  mathematics  achievement  at  T2. 

For  each  step — except  the  third  step,  which  focused  on 
effects  at  the  individual  level  only — we  specified  multilevel 
models  in  Mplus  7.11  (Muthen  &  Muthen,  1998-2010).  The 
mediation  analyses  were  based  on  the  approach  suggested  by 
Pituch  and  Stapleton  (2012),  which  is  suited  for  multilevel 
mediations  in  which  the  effect  of  a  variable  at  level  2  on  a 
variable  at  level  1  is  mediated  by  a  variable  at  level  1  (2  —  1  — 
1),  as  is  the  case  in  our  study.4  Following  this  approach,  we 
multiplied  the  relevant  path  coefficients  using  the  constraint 
command  in  Mplus.  We  used  confidence  intervals  (CIs)  to 
assess  the  statistical  significance  of  the  indirect  effects 
(CINTERVAL-Output5),  as  they  provide  a  higher  test  power 
than  common  significance  parameters  (Pituch  &  Stapleton, 
2012). 

To  account  for  differences  between  classes,  we  grand-mean 
centered  (Enders  &  Tofighi,  2007)  and  z-standardized  (M  =  0, 
SD  -  1)  all  continuous  variables  at  the  individual  level.  For 
ease  of  interpretation,  we  also  z-standardized  class-average 
mathematics  achievement.  In  all  models,  we  controlled  for 
federal  state,  school  track,  gender,  parents’  socioeconomic  sta¬ 
tus,  parents’  educational  background,  and  language  spoken  at 
home.  To  address  missing  data,  we  used  the  Full  Information 
Maximum  Likelihood  procedure,  which  is  implemented  in 
Mplus  (FIML;  Muthen  &  Muthen,  1998-2010).  This  procedure, 
along  with  multiple  imputation,  is  considered  to  be  the  state- 
of-the-art  method  for  handling  missing  data  and  obtaining  un¬ 
biased  parameter  estimates  (Graham,  2012).  In  the  present 


study,  between  17.9%  and  34.8%  of  data  concerning  the  math¬ 
ematics  achievement  test  and  the  self-report  on  mathematics 
self-concept  was  missing  at  the  three  different  measurement 
points.  Most  data  were  missing  at  T3  because  some  schools 
could  not  be  resampled.  Regarding  students  who  dropped  out  at 
T2  or  T3,  comparative  analyses  with  the  remaining  students 
showed  no  significant  differences  in  their  mathematics  self- 
concept  at  Tl.  With  respect  to  other  variables,  there  were  only 
small  differences  in  mathematics  achievement  (dT1T2  =  .08; 
dTIT3  ~  -12),  gender  ( dTIT2  =  .02;  dTIT3  —  .10),  highest 
Treiman  Index  ( dT1T2  =  .07;  dT]T3  =  .20),  language  spoken  at 
home  ( dTIT2  =  .10;  dTlT3  =  .16),  and  school  track  ( dTIT2  =  .02; 
driT3  ~  -18). 

Results 

Compositional  Effect  on  Achievement:  The  Effect  of 
Class- Average  Mathematics  Achievement  on  Students’ 
Individual  Mathematics  Achievement 

First,  we  modeled  the  compositional  effect  on  achievement, 
that  is,  the  influence  of  class-average  math  achievement  on 
students’  individual  math  achievement  after  controlling  for 
previous  individual  math  achievement  (see  Table  1).  Model  1 
estimates  the  compositional  effect  from  Tl  to  T2,  whereas 
Model  2  estimates  the  compositional  effect  from  Tl  to  T3. 
Model  3  combines  Models  1  and  2,  that  is,  Model  3  estimates 
the  effect  of  class-average  math  achievement  on  individual 
math  achievement  at  T2  and  T3  simultaneously  while  control¬ 
ling  for  individual  math  achievement  at  both  Tl  and  T2.  In  line 
with  previous  findings,  class-average  math  achievement  had  a 
positive  effect  on  students’  individual  math  achievement  at  both 
T2  (0  =  .16,  p  <  .001;  Model  1)  and  T3  (0  =  .13,  p  <  .001; 
Model  2).  Model  3  reveals  that,  even  when  controlling  for 
individual  math  achievement  at  T2  and  modeling  both  compo¬ 
sitional  effects  at  T2  and  T3  simultaneously,  both  effects  were 
equally  high  (for  T2:  0  =  .15,  p  <  .001;  for  T3:  0  =  .13,  p  < 
.01). 

Big-Fish-Little-Pond  Effect:  The  Effect  of  Class- 
Average  Mathematics  Achievement  on  Students’ 
Mathematics  Self-Concept 

In  the  second  step,  we  analyzed  the  BFLPE,  that  is  the  effect 
of  class-average  math  achievement  on  students’  math  self- 
concept  after  controlling  for  students’  math  achievement  (see 
Table  2).  In  Model  4,  we  modeled  the  BFLPE  in  a  cross- 
sectional  framework  by  regressing  students’  math  self-concept 


4  The  cross-level  mediation  approach  by  Pituch  and  Stapleton  (2012) 
differs  from  the  multilevel  structural  equation  modeling  (MSEM)  approach 
suggested  by  Preacher,  Zhang,  and  Zyphur  (2011).  Whereas  the  MSEM 
approach  focuses  on  mediation  at  the  between  level,  the  approach  by  Pituch 
and  Stapleton  (2012)  is  suitable  for  analyzing  cross-level  effects  mediated 
by  a  variable  at  the  individual  level,  which  is  the  case  in  the  present  study. 

5  Pituch  and  Stapleton  (2012)  recommend  the  PRODCLIN  program  for 
R.  The  CINTERVAL-Output-command  offers  an  equivalent  function  for 
MPlus  (source:  http://www.statmodel.eom/discussion/messages/l  1/1281 
.html?  1344653748). 
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Table  1 

The  Effect  of  Class-Average  Mathematics  Achievement  on  Students'  Individual  Mathematics 
Achievement  Over  the  Course  of  Grade  7 


Model  1 

Model  2 

Model  3 

ACH2 

ACH3 

ACH2 

ACH3 

Variable 

P(S£) 

(3  (SE) 

(3  (SE) 

(3  (SE) 

Level  1:  Individual  predictors 
ACH1 

.37*’*  (.02) 

.32***  (.02) 

.37***  (.02) 

.17’**  (.02) 

ACH2 

Level  2:  Class-level  predictors 
CA-ACH1 

.16***  (.03) 

.13**  (.04) 

.15***  (.03) 

.37*”  (.02) 

.13**  (.05) 

Residual  variance 

Level  1 

.35 

.38 

.35 

.33 

Level  2 

.10 

.14 

.10 

.15 

Note.  Control  variables  are  gender,  parents’  socioeconomic  status,  parents’  educational  background,  and 
language  spoken  at  home  at  Level  1,  and  school  track  and  federal  state  at  Level  2.  ACH1  =  students’  individual 
mathematics  achievement  at  the  beginning  of  Grade  7;  ACH2  =  students’  individual  mathematics  achievement 
in  the  middle  of  Grade  7;  ACH3  =  students’  individual  mathematics  achievement  at  the  end  of  Grade  7; 
CA-ACH1  =  classroom  average  mathematics  achievement  at  the  beginning  of  Grade  7. 

*  p  <  .05.  *><.01.  **><.001. 


at  T1  on  class-average  math  achievement  and  individual  math 
achievement  at  Tl.  We  then  also  investigated  the  BFLPE  in  a 
longitudinal  framework.  Models  5a  and  5b  focus  on  the  influ¬ 
ence  of  class-average  math  achievement  at  Tl  on  math  self- 
concept  at  T2,  with  Model  5b  additionally  controlling  for  math 
self-concept  at  Tl.  Models  6a  and  6b  specify  the  effects  of 
class-average  math  achievement  at  Tl  on  math  self-concept  at 
T3,  with  Model  6b  additionally  controlling  for  math  self- 
concept  at  Tl. 

In  line  with  previous  findings,  the  BFLPE  was  evident  when 
analyzed  in  a  cross-sectional  framework  (Model  4).  Class- 
average  math  achievement  had  a  negative  effect  of  [3  =  —.12 
( p  <  .001)  on  students’  math  self-concept  after  controlling  for 
individual  math  achievement.  This  negative  effect  was  also 


discernible,  although  somewhat  smaller,  from  Tl  to  T2 
((3  =  —.08,  p  <  .01;  Model  5a),  and  from  Tl  to  T3  ((3  =  —.10, 
p  <  .01;  Model  6a).  However,  when  we  additionally  controlled 
for  previous  math  self-concept  at  Tl,  the  effect  was  no  longer 
statistically  significant  ((3  =  —.03,  p  —  .24;  Model  5b; 
(3  =  -.06,  p  =  .10;  Model  6b). 

Reciprocal  Effects  Model:  Cross-Lagged  Associations 
Between  Mathematics  Achievement  and  Mathematics 
Self-Concept 

In  the  third  step,  we  specified  the  reciprocal  relations  between 
students’  math  achievement  and  their  math  self-concept  at  the 
individual  level.  Although  Model  7  presents  an  analysis  of  the 


Table  2 

The  Effect  of  Class-Average  Mathematics  Achievement  on  Students'  Mathematics  Self-Concept 
Over  the  Course  of  Grade  7 


Model  4 

Model  5 

Model  6 

a 

b 

a 

b 

ASCI 

ASC2 

ASC2 

ASC3 

ASC3 

Variable 

(3  (SE) 

0  (SE) 

(3  (SE) 

[3  (SE) 

0  (SE) 

Level  1:  Individual  predictors 
ASCI 

ACH1 

.30***  (.02) 

.23***  (.02) 

.54’**  (.02) 
.08***  (.02) 

.23’**  (.02) 

.43***  (.02) 
.12***  (.02) 

Level  2:  Class-level  predictors 
CA-ACH1 

-.12**’  (.03) 

-.08**  (.03) 

-.03  (.03) 

-.10**  (.04) 

-.06  (.03) 

Residual  variance 

Level  1 

.83 

.87 

.63 

.87 

.72 

Level  2 

.05 

.04 

.03 

.06 

.04 

Note.  Control  variables  are  gender,  parents’  socioeconomic  status,  parents’  educational  background,  and 
language  spoken  at  home  at  Level  1,  and  school  track  and  federal  state  at  Level  2.  ASCI  =  mathematics 
self-concept  at  the  beginning  of  Grade  7;  ASC2  =  mathematics  self-concept  in  the  middle  of  Grade  7;  ASC3  = 
mathematics  self-concept  at  the  end  of  Grade  7;  ACH1  =  students  individual  mathematics  achievement  at  the 
beginning  of  Grade  7;  CA-ACH1  =  classroom  average  mathematics  achievement  at  the  beginning  of  Grade  7 
><.05.  *><.01.  **><.001. 
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cross-lagged  paths  between  students’  math  achievement  and  their 
math  self-concept  at  T 1  and  T2,  Model  8  analyzes  the  correspond¬ 
ing  paths  at  T1  and  T3,  as  does  Model  9  at  T2  and  T3.  Finally, 
Model  10  analyzes  all  paths  simultaneously. 

As  Table  3  shows,  the  four  estimated  models  replicate  previous 
findings  showing  that  students’  math  achievement  and  students’ 
math  self-concept  mutually  reinforce  each  other.  The  path  repre¬ 
senting  self-enhancement,  that  is,  the  path  from  math  self-concept 
to  math  achievement,  was  (3  =  .11  (p  <  .001)  from  T1  to  T2 
(Model  7),  |3  =  .09  (p  <  .001)  from  T1  to  T3  (Model  8),  and  |3  = 
.05  (p  <  .001)  from  T2  to  T3  (Model  9).  When  analyzing  all  paths 
simultaneously  (Model  10),  the  coefficients  dropped  in  size  to  |3  = 
.03  (p  <  .05)  from  T1  to  T3  and  |3  =  .03  (p  =  .08)  from  T2  to  T3. 
The  path  representing  skill-development,  that  is,  the  path  from 
math  achievement  to  math  self-concept,  was  (3  =  .08  (p  <  .001) 
from  T1  to  T2  (Model  7),  (3  =  .10  (p  <  .001)  from  T1  to  T3 
(Model  8),  and  (3  =  .1 1  (p  <  .001)  from  T2  to  T3  (Model  9).  The 
latter  two  paths  also  dropped  in  size  in  Model  10  to  (3  =  .05  (p  < 
.01)  and  |3  =  .05  (p  <  .05).  Despite  small  variations,  the  self¬ 
enhancement  and  the  skill  development  paths  were  generally  com¬ 
parable  in  size,  thus  providing  evidence  for  the  reciprocal  effects 
model. 

The  Effect  of  Class-Average  Mathematics  Achievement 
on  Students’  Mathematics  Achievement  and 
Mathematics  Self-Concept  over  Time 

In  the  fourth  step,  we  explicitly  addressed  the  interplay 
between  the  compositional  effect  on  achievement  and  the 
BFLPE  over  time  by  estimating  the  previous  models  simulta¬ 
neously  over  the  time  period  of  one  school  year.  In  doing  so,  we 
modeled  the  BFLPE  in  a  longitudinal  framework.  In  Model  11, 
we  estimated  the  effect  of  class-average  math  achievement  on 
students’  math  achievement  and  math  self-concept  simultane¬ 
ously  at  T2,  controlling  for  their  math  achievement  and  math 
self-concept  at  Tl.  In  Model  12,  we  specified  the  same  paths  for 
the  time  period  of  Tl  to  T3.  Model  13  includes  the  paths  for  all 
three  measurement  points. 


As  Table  4  shows,  Models  11  to  13  reveal  the  same  patterns 
of  results  as  the  separate  models  specified  in  the  previous 
sections.  Controlling  for  individual  math  achievement  at  Tl,  the 
positive  effect  of  class-average  math  achievement  at  Tl  on 
math  achievement  was  present  at  T2  ((3  =  .17,  p  <  .001;  Model 
11)  and  at  T3  ((3  =  .14,  p  <  .01;  Model  12).  Even  when 
controlling  for  individual  math  achievement  at  T2  while  mod¬ 
eling  the  compositional  effect  on  achievement  at  T3  (Model 
13),  the  positive  effect  of  class-average  math  achievement 
remained  (3  =  .14,  p  <  .01).  Regarding  the  BFLPE,  as  in  the 
separate  models,  there  was  no  statistically  significant  effect  of 
class-average  math  achievement  at  Tl  on  students’  math  self- 
concept  at  T2  or  T3  after  controlling  for  their  math  self-concept 
at  Tl  (coefficients  ranging  from  3  =  —.06  to  3  =  —  .03;  see 
Models  11,  12,  and  13).  The  reciprocal  relationship  between 
individual  math  achievement  and  math  self-concept  found  in 
the  separate  models  were  also  found  in  the  full  model.  That  is, 
the  cross-lagged  paths  were  similar  in  size,  and  there  were  only 
marginal  differences  between  the  different  time  periods.  Figure 
5  also  depicts  all  theoretically  relevant  path  coefficients  from 
Model  13. 

Taken  together,  the  findings  show  that  class-average  math 
achievement  has  a  continuous  positive  effect  on  students’  indi¬ 
vidual  math  achievement  throughout  Grade  7,  but  no  continu¬ 
ous  negative  effect  on  students’  math  self-concept  after  taking 
into  account  students’  math  achievement  and  math  self-concept 
at  the  beginning  of  Grade  7.  In  order  to  better  understand  the 
interplay  between  the  compositional  effect  on  achievement  and 
the  BFLPE,  we  further  investigated  mediational  effects  by 
analyzing  whether  the  effect  of  class-average  math  achievement 
on  individual  math  achievement  and  math  self-concept  at  the 
end  of  Grade  7  was  mediated  via  math  achievement  and  math 
self-concept  in  the  middle  of  Grade  7.  In  line  with  the  nonsig¬ 
nificant  BFLPE  described  above,  we  found  that  the  effect  of 
class-average  math  achievement  at  Tl  on  math  achievement  at 
T3  was  not  mediated  via  students’  math  self-concept  at  T2 
(indirect  effect:  3  =  -.00;  p  =  .29;  95%  Cl  [-.00,  .00]).  The 
positive  effect  of  class-average  math  achievement  at  Tl  on 


Table  3 


Cross-Lagged  Associations  Between  Students’  Achievement  and  Academic  Self-Concept  Over  the  Course  of  Grade  7 


Variable 

Model  7 

Model  8 

Model  9 

Model  10 

ASC2 

ACH2 

ASC3 

ACH3 

ASC3  ACH3 

ASC2  ACH2 

ASC3 

ACH3 

3  (SE) 

3  ( SE) 

3  (SE) 

3  (SE) 

3  (SE)  3  (SE) 

3  (SE)  3  (SE) 

3  ( SE) 

3  (SE) 

Level  1:  Individual 

predictors 

ASCI 

.54***  (.02) 

.11***  (.01) 

.45***  (.02) 

.09***  (.01) 

.54***  (.02)  .11***  (.01) 

.22***  (.02) 

.03*  (.01) 

ACH1 

.08***  (.02) 

.41***  (.02) 

.10***  (.02) 

.34***  (.02) 

.08***  (.02)  .41***  (.02) 

.05**  (.02) 

.16***  (.02) 

ASC2 

.53***  (.02)  .05***  (.01) 

.41***  (.02) 

.03  (.02) 

ACH2 

.11***  (.02)  .53***  (.02) 

.05’  (.02) 

.44***  (.03) 

Level  1  residual 

variance 

.65 

.43 

.76 

.51 

.69  .45 

.65  .43 

.66 

.43 

Note.  Control  variables  are  gender,  parents’  socioeconomic  status,  parents’  educational  background,  and  language  spoken  at  home  at  Level  1.  ASCI  = 
mathematics  self-concept  at  the  beginning  of  Grade  7;  ASC2  =  mathematics  self-concept  in  the  middle  of  Grade  7;  ASC3  =  mathematics  self-concept  at 
the  end  of  Grade  7;  ACH1  =  students’  individual  mathematics  achievement  at  the  beginning  of  Grade  7;  ACH2  =  students’  individual  mathematics 
achievement  in  the  middle  of  Grade  7;  ACH3  =  students’  individual  mathematics  achievement  at  the  end  of  Grade  7. 

<  .05.  **p  <  .01.  ***p  <  .001. 
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Table  4 

The  Effect  of  Class-Average  Achievement  on  Students’  Individual  Achievement  and  Academic  Self-Concept  Over  the  Course  of  Grade  7 


Model  1 1 

Model  12 

Model  13 

ASC2 

ACH2 

ASC3 

ACH3 

ASC2 

ACH2 

ASC3 

ACH3 

Variable 

P  (SE) 

P  (SE) 

P  (SE) 

P  (SE) 

P  (SE) 

P  (SE) 

P  (SE) 

P  (SE) 

Level  1:  Individual  predictors 
ASCI 

.54***  (.02) 

.12***  (.01) 

.43***  (.02) 

.10***  (.01) 

.54***  (.02) 

.12***  (.01) 

.21***  (.02) 

.04**  (.01) 

ACH1 

.08***  (.02) 

.34***  (.02) 

.11***  (.02) 

.29***  (.02) 

.08***  (.02) 

.34***  (.02) 

.06**  (.02) 

.16***  (.02) 

ASC2 

ACH2 

Level  2:  Class-level  predictors 
CA-ACH1 

-.03  (.03) 

.17***  (.03) 

-.06  (.03) 

.14**  (.04) 

-.03  (.02) 

.16***  (.03) 

.39***  (.02) 
.08**  (.03) 

-.05  (.03) 

.04**  (.02) 
.35***  (.02) 

.14**  (.04) 

Residual  variance 

Level  1 

.63 

.34 

.72 

.37 

.63 

.34 

.62 

.33 

Level  2 

.03 

.10 

.04 

.14 

.03 

.10 

.05 

.15 

Note.  Control  variables  are  gender,  parents’  socioeconomic  status,  parents’  educational  background,  and  language  spoken  at  home  at  Level  1,  and  school 
track  and  federal  state  at  Level  2.  ASCI  =  mathematics  self-concept  at  the  beginning  of  Grade  7;  ASC2  =  mathematics  self-concept  in  the  middle  of  Grade 
7;  ASC3  =  mathematics  self-concept  at  the  end  of  Grade  7;  ACH1  =  students’  individual  mathematics  achievement  at  the  beginning  of  Grade  7;  ACH2  = 
students’  individual  mathematics  achievement  in  the  middle  of  Grade  7;  ACH3  =  students’  individual  mathematics  achievement  at  the  end  of  Grade  7; 
CA-ACH1  =  classroom  average  mathematics  achievement  at  the  beginning  of  Grade  7. 

><.05.  *><.01.  **><.001. 


students’  individual  math  achievement  at  T3,  that  is,  the  com¬ 
positional  effect  on  achievement,  was  partially  mediated  by 
individual  math  achievement  at  T2  (indirect  effect:  @  =  .06, 
p  <  .001;  95%  Cl  [.03,  .08]).  Similarly,  we  found  a  statistically 
significant,  albeit  small,  indirect  effect  of  class-average  math 
achievement  at  T1  on  students’  math  self-concept  at  T3  via 
individual  math  achievement  at  T2  ((3  =  .01,  p  <  .01;  95%  Cl 
[.00,  .02]),  but  not  via  math  self-concept  at  T2  ((3  =  —.01,  p  = 
.23;  95%  Cl  [—.03,  .01]).6  The  findings  from  the  mediational 
analyses  thus  confirm  that  the  higher  the  average  math  achieve¬ 
ment  of  a  class  at  the  beginning  of  Grade  7,  the  greater  the 
achievement  gains  of  individual  students  throughout  the  school 
year.  In  contrast,  class-average  math  achievement  at  T1  did  not 
affect  students’  math  self-concept  over  the  course  of  Grade  7.  In 
fact,  class-average  math  achievement  even  had  a  small  positive 
buffering  effect  on  math  self-concept  at  T3  via  students’  indi¬ 
vidual  math  achievement  at  T2. 

Additional  Analyses:  Modeling  the 
Cross-Sectional  BFLPE 

As  the  previous  section  described,  to  look  at  the  effect  of 
class-average  math  achievement  on  individual  math  achieve¬ 
ment  and  math  self-concept  simultaneously  in  the  full  model, 
we  modeled  the  BFLPE  in  a  longitudinal  framework  in  order  to 
compare  both  effects.  We  did  not  find  a  statistically  significant 
negative  effect  of  class-average  math  achievement  on  students’ 
math  self-concept  when  we  controlled  for  students’  achieve¬ 
ment  throughout  Grade  7.  However,  in  the  separate  models 
presented  above,  we  did  find  a  statistically  significant  BFLPE 
when  modeling  it  cross-sectionally  at  the  beginning  of  Grade  7, 
indicating  that  the  negative  effect  of  class-average  math 
achievement  on  students’  math  self-concept  happened  earlier  in 
the  school  year.  Therefore,  it  is  possible  that  we  underestimated 
the  BFLPE  when  analyzing  the  compositional  effect  on 
achievement  and  the  BFLPE  simultaneously  in  a  longitudinal 


framework.  In  order  to  test  how  the  BFLPE  observed  at  the 
beginning  of  the  school  year  affected  students’  academic  de¬ 
velopment  throughout  Grade  7,  we  specified  two  additional 
mediational  models  in  which  we  analyzed  whether  students’ 
math  self-concept  at  T1  mediated  the  effect  of  class-average 
math  achievement  at  T1  on  students’  math  achievement  at  T3 
and  their  math  self-concept  at  T3. 

In  fact,  we  found  a  small  negative  indirect  effect  of  (3  =  —.01 
ip  <  .01;  95%  Cl  [  —  .02,  —.01])  from  class-average  math 
achievement  at  T1  on  students’  math  achievement  at  T3  via 
their  math  self-concept  at  Tl.  Additionally,  we  found  students’ 
math  self-concept  at  T3  to  be  negatively  affected  by  class- 
average  math  achievement  at  Tl  via  their  math  self-concept  at 
Tl  (indirect  effect:  |3  =  -.01,  p  <  .01;  95%  Cl  [-.02,  -.01]). 
Therefore,  we  did  find  the  BFLPE  at  the  beginning  of  Grade  7 
to  have  a  negative  effect  on  students’  math  achievement  and 
their  math  self-concept  at  the  end  of  the  school  year,  which  we 
were  not  able  to  see  in  the  full  model  when  we  modeled  the 
BFLPE  longitudinally.  However,  the  indirect  effects  of  class- 
average  math  achievement  on  students’  academic  development 
via  math  self-concept — which  were  evident  even  when  model¬ 
ing  the  BFLPE  cross-sectionally — were  not  as  large  as  the 
indirect  effects  via  math  achievement  described  in  the  previous 
section. 

Additional  Analyses:  Findings  for  Other  Subjects 

In  order  to  investigate  whether  our  findings  could  be  repli¬ 
cated  for  other  school  subjects,  we  also  analyzed  the  full  model 
for  the  subjects  of  biology,  physics,  and  English  as  a  foreign 


6  The  mediational  effect  of  class-average  math  achievement  at  Tl  on 
math  self-concept  at  T3  via  individual  math  achievement  at  T2  may  be 
interpreted  as  an  inconsistent  mediation  (MacKinnon,  2008)  as  the  direct 
effect  ((3  =  -.05)  was  nonsignificant. 
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Figure  5.  Path  model  of  the  relationship  between  the  compositional  effect  on  achievement,  the  big-fish— little- 
pond  effect,  and  the  reciprocal  effects  model  (Model  13).  Only  theoretically  relevant  paths  are  shown  here. 
Analyses  additionally  include  the  estimation  of  cross-sectional  paths  between  individual  achievement  and 
academic  self-concept  as  well  as  the  cross-lagged  and  auto-correlational  paths  between  Time  1  and  Time  3. 
ASCI  =  mathematics  self-concept  at  the  beginning  of  Grade  7;  ASC2  =  mathematics  self-concept  in  the  middle 
of  Grade  7;  ASC3  =  mathematics  self-concept  at  the  end  of  Grade  7;  ACH1  =  students’  individual  mathematics 
achievement  at  the  beginning  of  Grade  7;  ACH2  =  students’  individual  mathematics  achievement  in  the  middle 
of  Grade  7;  ACH3  =  students’  individual  mathematics  achievement  at  the  end  of  Grade  7;  CA-ACH1  = 
classroom  average  mathematics  achievement  at  the  beginning  of  Grade  7;  LI  =  Level  1  (individual  level);  L2  = 
Level  2  (class  level).  **  p  <  .01.  ***  p  <  .001. 


language,  all  of  which  were  also  assessed  in  the  study.7  In  this 
section,  we  summarize  the  findings  from  these  models  with 
regard  to  the  compositional  effect  on  achievement,  the  BFLPE, 
and  corresponding  mediational  effects.  In  biology,  the  influence 
of  class-average  achievement  on  individual  achievement  was 
3  =  .17  (p  <  .001)  from  T1  to  T2,  and  (3  =  .11  (p  <  .01)  from 
T1  to  T3,  and  the  BFLPE  no  longer  appeared  to  be  significant 
in  the  course  of  the  school  year  (T1  to  T2:  (3  =  .02,  p  —  .54;  T1 
to  T3:  (3  =  .00,  p  =  .92).  The  positive  effect  of  class-average 
achievement  on  individual  achievement  at  T3  was  mediated  via 
individual  achievement  at  T2  (indirect  effect:  3  =  .07,  p  < 
.001;  95%  Cl  [.03,  .10]),  and  the  indirect  path  via  academic 
self-concept  at  T2  was  not  significant  (3  —  -00;  p  =  .55;  95% 
Cl  [-.00,  .00]).  Concerning  the  negative  effect  of  class-average 
achievement  at  T1  on  academic  self-concept  at  T3,  the  indirect 
path  coefficients  via  academic  self-concept  at  T2  (3  =  01,  p  = 
.35;  95%  Cl  [-.01,  .03])  and  via  individual  achievement  at  T2 
(3  =  .01,  p  =  .05;  95%  Cl  [.00,  .02])  were  not  significant. 

In  physics,  the  positive  effect  of  class-average  achievement  on 
individual  achievement  was  3  =  .16  (p  <  .001)  from  T1  to  T2,  and 
3  =  .21  (p  <  .001)  from  T1  to  T3.  Class-average  achievement  had 
no  statistically  significant  effect  on  academic  self-concept  at  T2 
(T1  to  T2:  3  =  .01,  p  =  .85)  oratT3  (T1  toT3:  3  =  .01,  p  =  .66). 
The  positive  effect  on  individual  achievement  at  the  end  of  Grade 
7  was  mediated  by  individual  achievement  at  T2  (indirect  effect: 
3  =  .05,  p  <  .001;  95%  Cl  [.03,  .07])  but  not  by  academic 
self-concept  at  T2  (indirect  effect:  3  =  00,  p  =  .84;  95%  Cl 
[-.00,  .00]).  For  the  prediction  of  academic  self-concept  at  T3, 
there  was  a  small  mediation  via  individual  achievement  at  T2 
(indirect  effect:  3  =  .01;  p  <  .01;  95%  Cl  [.00,  .02]),  but  not  via 


academic  self-concept  at  T2  (indirect  effect:  3  —  -00;  p  —  .85; 
95%  Cl  [-.02,  .02]). 

In  English  as  a  foreign  language,  we  found  the  same  pattern. 
Class-average  achievement  had  a  positive  effect  on  individual 
achievement  at  T2  (3  =  .44,  p  <  .001)  and  at  T3  (3  =  .43,  p  < 
.001),  but  no  effect  on  academic  self-concept  at  T2  (3  =  —.01, 
p  =  .75)  or  at  T3  (3  =  —.03,  p  =  .40).  The  positive  effect  on 
achievement  at  T3  was  also  mediated  by  individual  achievement  at 
T2  (indirect  effect:  3  =  .14,  p  <  .001;  95%  Cl  [.11,  .18]),  whereas 
the  indirect  path  via  academic  self-concept  at  T2  was  not  signifi¬ 
cant  (3  =  -.00,  p  =  .13;  95%  Cl  [—.00,  .00]).  There  were  no 
significant  indirect  paths  from  class-average  achievement  at  T1  on 
academic  self-concept  at  T3  via  individual  achievement  at  T2  (3  = 


7  The  academic  achievement  in  the  subjects  of  biology,  physics,  and 
English  as  a  foreign  language  was  measured  by  standardized  tests.  Each 
test  in  biology  was  taken  from  the  Second  International  Science  Study  IEA 
(SISS;  Rosier  &  Keeves,  1991),  the  Lernerfolgstest  JT  8  (Zentrum  fur 
Schulversuche  und  Schulentwicklung  des  Bundesministeriums  fur  Unter- 
richt,  Kunst  und  Sport,  1975),  the  National  Assessment  of  Educational 
Progress  (NAEP;  National  Center  for  Education  Statistics,  1989),  and  the 
IEA  Six-Subject  Survey  (Walker,  1976).  The  tests  in  physics  at  the  three 
measurement  points  used  items  from  SISS  (Rosier  &  Keeves,  1991),  the 
Lernerfolgstest  JT  8  (Zentrum  fiir  Schulversuche  und  Schulentwicklung 
des  Bundesministeriums  fiir  Unterricht,  Kunst  und  Sport,  1975),  and  from 
the  NAEP  (National  Center  for  Education  Statistics,  1989).  The  tests  in 
English  as  a  foreign  language  at  the  three  measurement  points  comprised 
items  taken  from  the  IEA  Six-Subject  Survey  (Walker,  1976),  the  MPI 
Schulleistungsstudie  (Baumert  et  al.,  1986;  Edelstein,  1970),  and  Schrand, 
Mulch,  Portmann,  and  Stark  (1974).  The  instruments  measuring  academic 
self-concept  were  the  same  as  those  used  for  mathematics;  only  the  terms 
for  the  subject  were  changed. 
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.01,  p  =  .13;  95%  Cl  [—.01,  .04])  or  academic  self-concept  at  T2 
(P  =  —  .01,  p  =  .13;  95%  Cl  [—.04,  .03]).  Our  findings  for  the 
three  subjects  indicate  that,  similar  to  the  findings  for  mathematics, 
the  positive  effect  of  class-average  achievement  on  students’  in¬ 
dividual  achievement  was  stronger  than  its  negative  effect  on  their 
academic  self-concept. 

Discussion 

The  aim  of  the  present  study  was  to  investigate  how  class- 
average  achievement  affects  students’  individual  achievement  and 
their  academic  self-concept,  and  how  that,  in  turn,  affects  subse¬ 
quent  academic  self-concept  and  achievement.  We  used  data  col¬ 
lected  from  students  at  three  measurement  points  over  the  course 
of  Grade  7.  We  focused  on  mathematics  but  replicated  the  findings 
for  other  subjects  (biology,  physics,  English  as  a  foreign  language) 
in  additional  analyses. 

The  analyses  revealed  that  class-average  achievement  at  the 
beginning  of  Grade  7  had  a  positive  effect  on  students’  individual 
achievement  in  the  middle  and  at  the  end  of  the  same  school  year, 
controlling  for  students’  achievement  at  the  beginning  of  Grade  7, 
thus  replicating  previous  research  on  the  compositional  effect  on 
achievement  (e.g.,  Hanushek  et  al.,  2003;  Marks,  2010).  With 
respect  to  the  negative  effect  of  class-average  achievement  on 
students’  academic  self-concept  after  controlling  for  students’  in¬ 
dividual  achievement — the  BFLPE — we  only  found  an  effect  at 
the  beginning  of  Grade  7;  this  effect  was  not  present  when  we  used 
a  longitudinal  framework  to  predict  academic  self-concept  in  the 
middle  and  at  the  end  of  Grade  7  after  additionally  controlling  for 
academic  self-concept  at  the  beginning  of  Grade  7.  This  finding  is 
also  in  line  with  previous  research,  which  has  found  strong  evi¬ 
dence  for  the  BFLPE  in  cross-sectional  studies  but  only  limited 
evidence  in  longitudinal  studies  (Marsh  et  al.,  2001,  2007).  More¬ 
over,  we  found  that  students’  achievement  and  academic  self- 
concept  were  reciprocally  related  over  the  course  of  the  school 
year,  which  confirms  previous  findings  on  the  REM  (Marsh  & 
O’Mara,  2008;  Seaton  et  al.,  2014).  When  we  addressed  our 
research  question  by  analyzing  all  effects  described  above  simul¬ 
taneously,  we  found  that  the  positive  effect  of  class-average 
achievement  on  students’  achievement  was  much  stronger  than  the 
negative  effect  of  class-average  achievement  on  academic  self- 
concept  over  the  course  of  the  school  year.  In  fact,  although  the 
compositional  effect  on  achievement  was  present  throughout  the 
entire  school  year,  the  BFLPE  was  only  observable  at  the  begin¬ 
ning  of  the  school  year.  Furthermore,  mediation  analyses  revealed 
that  the  effects  of  class-average  achievement  on  students’  achieve¬ 
ment  and  academic  self-concept  at  T3  were  mediated  by  achieve¬ 
ment  at  T2,  but  not  by  academic  self-concept  at  T2.  That  is,  the 
decline  in  academic  self-concept  in  response  to  class-average 
achievement  did  not  result  in  lower  achievement  or  lower  aca¬ 
demic  self-concept.  Taken  together,  the  compositional  effect  on 
achievement  played  a  larger  role  for  students’  development  in  our 
study  than  did  the  BFLPE. 

Limitations  of  the  Present  Study 

When  interpreting  the  findings  from  this  study,  some  limitations 
must  be  addressed.  First,  our  study  was  conducted  in  the  German 
school  system.  Even  though  the  German  system  is  particularly  well 


suited  for  investigating  the  effects  of  average  achievement  because 
of  its  tracked  system,  it  would  be  interesting  to  investigate  whether 
and  how  this  interplay  is  apparent  in  other  school  systems,  partic¬ 
ularly  those  that  are  less  rigidly  tracked.  In  less  selective  school 
systems,  for  example,  certain  effects  may  not  be  as  strong. 

Second,  the  present  study  covered  a  time  period  of  one  school 
year.  This  represents  only  a  short  period  in  view  of  an  entire  school 
career,  and  we  cannot  draw  any  conclusions  about  the  interplay 
between  the  compositional  effect  on  achievement  and  the  BFLPE 
over  a  longer  period  of  time  or  at  a  different  point  in  a  student’s 
school  career. 

Third,  the  data  were  assessed  in  1991/1992.  To  our  knowledge, 
there  was  not  a  more  recent  dataset  that  met  all  the  necessary 
criteria  for  analyzing  our  research  questions.  Even  though  the 
compositional  effect  and  the  BFLPE  are  universal  context  effects 
that  should  not  have  changed  over  time,  a  replication  of  our 
findings  with  more  recent  data  would  be  worthwhile. 

Finally,  the  present  study  does  not  make  any  causal  claims  as  is 
the  case  in  causal  mediation  analyses  (e.g.,  Imai,  Keele,  Tingley,  & 
Yamamoto,  2011;  Valeri  &  VanderWeele,  2013).  We  do  not 
presume  to  have  identified  the  mechanisms  underlying  the  ob¬ 
served  effects  of  class-average  achievement  on  individual  achieve¬ 
ment  and  academic  self-concept,  or  to  have  controlled  for  all 
potentially  confounding  factors.  Other  theoretically  plausible  vari¬ 
ables  (e.g.,  instruction)  that  were  not  included  in  the  present  study 
may  also  be  relevant  for  these  relationships.  We  interpret  our 
results  as  important  descriptive  analyses  of  the  relative  importance 
of  the  compositional  effect  on  achievement  and  the  BFLPE  but 
stress  the  importance  of  conducting  additional  research  to  provide 
a  more  specific  causal  explanation  (for  the  distinction  between 
description  vs.  causal  explanation,  see  Foster,  2010). 

Theoretical  Significance  of  Our  Findings  and 
Implications  for  Future  Research 

The  marginal  role  of  the  BFLPE  over  the  course  of  a  school  year 
might  be  due  to  the  fact  that,  after  a  phase  in  which  students 
compared  themselves  with  the  other  students  in  their  class  at  the 
beginning  of  the  school  year,  resulting  in  changes  in  their  aca¬ 
demic  self-concept,  the  relative  position  of  students  in  their  class¬ 
room  stayed  largely  the  same  throughout  the  year.  Hence,  we 
observed  no  further  adjustments  in  their  academic  self-concepts.  A 
similar  pattern  appeared  in  a  study  by  Wouters  et  al.  (2012),  who 
found  that  changing  from  a  high-achieving  track  to  a  low- 
achieving  track  leads  to  an  increase  in  academic  self-concept  and 
a  decrease  in  achievement.  Whereas  there  was  a  continuous  neg¬ 
ative  effect  on  academic  achievement  in  the  following  years  for 
students  who  changed  to  a  low-achieving  track,  the  development 
of  these  students’  academic  self-concept  did  not  differ  from  the 
development  of  the  academic  self-concept  of  students  who  stayed 
in  the  same  learning  group. 

As  for  the  compositional  effect  on  achievement,  which  was 
present  throughout  the  whole  school  year,  being  surrounded  by 
high-achieving  students  may  have  provided  a  continuously  stim¬ 
ulating  learning  environment  for  students.  More  specifically,  and 
taking  into  account  the  mechanisms  that  have  been  proposed  in  the 
literature  to  explain  compositional  effects  (see  review  by  Dumont 
et  al.,  2013;  see  also  van  Ewijk  &  Sleegers,  2010),  students  may 
have  influenced  one  another  and  teachers  may  have  provided 
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cognitively  more  demanding  instruction  for  high-achieving  groups 
(e.g.,  Dreeben  &  Barr,  1988;  Harker  &  Tymms,  2004;  Harris, 
2010).  Further  mechanisms  can  be  assumed  in  line  with  social- 
cognitive  theory  (Bandura,  1986) — high-achieving  students  may 
be  role  models  for  their  classmates,  who  imitate  the  observed 
learning  behavior  (i.e.,  their  approaches  to  solving  mathematical 
problems),  or  who  are  influenced  implicitly  by  their  motivation  to 
learn  (i.e.,  goal  contagion;  Aarts,  Gollwitzer,  &  Hassin,  2004). 

Based  on  the  cautionary  note  on  causation  in  the  previous 
section,  we  can  only  make  assumptions  concerning  mechanisms 
underlying  the  observed  effects  of  class-average  achievement  on 
students  individual  achievement  and  their  academic  self-concept. 
Therefore,  more  research  is  needed  in  that  respect.  Understanding 
the  underlying  mechanisms  is  not  only  theoretically  important,  but 
it  would  also  be  helpful  in  order  to  develop  interventions  for 
buffering  the  negative  effects  of  the  BFLPE  or  supporting  students 
in  terms  of  a  more  robust  and  realistic  academic  self-concept.  For 
instance,  recent  studies  have  shown  that,  for  elementary  schools, 
the  use  of  differentiated  instruction  strategies  moderates  the 
BFLPE  on  academic  self-concept  (Roy  et  al„  2015).  Similarly, 
earlier  studies  showed  that  when  teachers  use  an  individual  frame 
of  reference,  this  has  a  positive  effect  on  students’  academic 
self-concept  (Liidtke,  Roller,  Marsh,  &  Trautwein,  2005)  and  on 
their  achievement  (for  an  overview,  see  Mischo  &  Rheinberg, 
1995;  Rheinberg  &  Krug,  1999). 

With  respect  to  research  on  the  BFLPE,  our  study  shows  that 
one  might  come  to  different  conclusions  about  the  importance  of 
this  effect  for  students’  academic  development  when  analyzing  it 
in  a  longitudinal  framework.  The  BFLPE  may  also  be  seen  in  a 
different  light  when  taking  into  account  the  positive  effect  that 
class-average  achievement  has  on  individual  achievement.  In 
terms  of  the  metaphor  introduced  at  the  beginning  of  this  article, 
one  may  conclude  that,  despite  small  losses  in  academic  self- 
concept,  it  may  be  worthwhile  for  a  fish  to  swim  in  a  “big  pond” 
because  of  the  positive  effects  on  achievement.  Therefore,  we 
would  like  to  encourage  researchers  to  not  only  investigate  the 
negative  effects  of  class-average  achievement  on  students’  aca¬ 
demic  self-concept,  but  to  simultaneously  consider  other  important 
dimensions  of  students’  academic  development,  such  as  individual 
academic  achievement,  for  which  class-average  achievement  may 
have  positive  effects.  On  a  more  general  note,  our  study  also 
suggests  that  research  from  different  strands  should  be  integrated 
in  order  to  better  understand  students’  development  and  to  arrive  at 
more  substantiated  findings  for  practice  and  policy. 
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Appendix  A 

Mathematics  Self-Concept  Items 

■  I  would  much  prefer  math  if  it  weren’t  so  hard.  (1  =  strongly  agree  to  4  =  strongly  disagree ) 

■  Although  I  make  a  real  effort,  math  seems  to  be  harder  for  me  than  for  my  fellow  students.  (1  =  strongly 
agree  to  4  =  strongly  disagree ) 

■  Nobody’s  perfect,  but  I’m  just  not  good  at  math.  (1  =  strongly  agree  to  4  =  strongly  disagree) 

■  Some  topics  in  math  are  just  so  hard  that  I  know  from  the  start  I’ll  never  understand  them.  (1  =  strongly 
agree  to  4  =  strongly  disagree ) 

■  Math  just  isn’t  my  Thing.  (1  =  strongly  Agree  to  4  =  strongly  Disagree ) 


(Appendices  continue ) 
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Appendix  B 

Intercorrelations  of  All  Variables  Considered  in  the  Analyses 
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Note.  Intercorrelations  of  class-average  mathematics  achievement  at  T1  and  control  variables  are  estimated  at  Level  2.  Federal  state  Dummy:  MWP  = 
Mecklenburg-Western  Pomerania,  SA  =  Saxony-Anhalt;  ASCI  =  mathematics  self-concept  at  the  beginning  of  Grade  7;  ASC2  =  mathematics 
self-concept  in  the  middle  of  Grade  7;  ASC3  =  mathematics  self-concept  at  the  end  of  Grade  7;  ACH1  =  students’  individual  mathematics  achievement 
at  the  beginning  of  Grade  7;  ACH2  =  students’  individual  mathematics  achievement  in  the  middle  of  Grade  7;  ACH3  =  students’  individual  mathematics 
achievement  at  the  end  of  Grade  7;  CA-ACH1  =  classroom  average  mathematics  achievement  at  the  beginning  of  Grade  7. 
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Homework  and  Achievement:  Using  Smartpen  Technology  to 

Find  the  Connection 
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There  is  a  long  history  of  research  efforts  aimed  at  understanding  the  relationship  between  homework 
activity  and  academic  achievement.  While  some  self-report  inventories  involving  homework  activity 
have  been  useful  for  predicting  academic  performance,  self-reported  measures  may  be  limited  or  even 
problematic.  Here,  we  employ  a  novel  method  for  accurately  measuring  students’  homework  activity 
using  smartpen  technology.  Three  cohorts  of  engineering  students  in  an  undergraduate  statics  course  used 
smartpens  to  complete  their  homework  problems,  thus  producing  records  of  their  work  in  the  form  of 
timestamped  digitized  pen  strokes.  Consistent  with  the  time-on-task  hypothesis,  there  was  a  strong  and 
consistent  positive  correlation  between  course  grade  and  time  doing  homework  as  measured  by  smartpen 
technology  (r  =  .44),  but  not  between  course  grade  and  self-reported  time  doing  homework  (r  =  —.16). 
Consistent  with  an  updated  version  of  the  time-on-task  hypothesis,  there  was  a  strong  correlation  between 
measures  of  the  quality  of  time  spent  on  homework  problems  (such  as  the  proportion  of  ink  produced  for 
homework  within  24  hr  of  the  deadline)  and  course  grade  ( r  =  —.32),  and  between  writing  activity  (such 
as  the  total  number  of  pen  strokes  on  homework)  and  course  grade  (r  =  .49).  Overall,  smartpen 
technology  allowed  a  fine-grained  test  of  the  idea  that  productive  use  of  homework  time  is  related  to 
course  grade. 

Keywords:  homework,  time-on-task,  educational  technology,  data  mining 


Homework  is  defined  as  “tasks  assigned  to  students  by  school 
teachers  that  are  meant  to  be  carried  out  during  non-school  hours” 
(Cooper,  1989,  p.  7).  Homework  has  the  potential  to  improve 
academic  learning,  perhaps  by  extending  time  to  learn  beyond  the 
classroom  and  priming  active  cognitive  processing  for  learning 
(Cooper,  1989,  2001;  Mayer,  2011).  Assigning  homework  prob¬ 
lems  to  be  solved  by  students  outside  of  class  time  is  a  common 
practice  in  college  courses  in  engineering,  mathematics,  and  sci¬ 
ence.  The  goal  of  the  present  study  is  to  determine  how  students’ 
problem-solving  activity  on  homework  is  related  to  their  course 
grade  in  introductory-level  engineering  courses. 

Smartpen  Technology 

Suppose  a  teacher  assigns  homework  problems  for  students  to 
work  on  each  week.  How  can  we  know  the  degree  to  which 
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students  engage  with  the  homework  assignment?  We  could  ask 
them  to  report  how  much  time  (or  effort)  they  put  into  the  home¬ 
work  assignments,  but  self-reported  measures  can  be  problematic. 
Instead,  imagine  that  a  teacher  could  assign  homework  problems  to 
students  and  be  able  to  monitor  the  student’s  homework  activity  at 
any  time  and  any  place,  even  outside  of  class.  In  short,  suppose  we 
had  a  way  to  know  when  a  student  was  working  on  a  homework 
assignment  and  we  were  able  to  record  every  pen  stroke  a  student 
made  while  working  on  a  handwritten  assignment.  This  level  of 
rich  data  mining  of  student  handwritten  homework  activity  em¬ 
ployed  in  the  current  study  is  enabled  by  the  use  of  newly  devel¬ 
oped  smartpen  technology  that  accomplishes  this  goal  (Herold, 
Stahovich,  Lin,  &  Calfee,  2011). 

Rationale 

Researchers  have  long  sought  to  understand  the  role  of  study 
activities  (including  homework  activities)  in  academic  achieve¬ 
ment.  For  example,  Jones  and  Ruch  (1928)  examined  the  relation¬ 
ship  between  the  amount  of  time  spent  studying  and  first  semester 
grade  point  average.  More  recently,  Crede  and  Kuncel  (2008) 
conducted  a  meta-analysis  of  10  study  habit,  skill,  and  attitude 
inventories  and  found  that  they  had  incremental  validity  in  pre¬ 
dicting  academic  performance. 

Much  of  this  work  relies  on  surveys  and  students’  self-reports  of 
study  habits,  which  may  limit  the  reliability.  For  example,  Schu- 
man,  Walsh,  Olson,  and  Etheridge  (1985)  found  little  relation 
between  study  time  and  grades,  and  attributed  this  to  “the  possible 
invalidity  of  student  reports  of  their  own  studying”  (p.  961). 
Blumner  and  Richards  (1997)  found  that  a  study  habit  inventory 
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was  useful  for  differentiating  between  high-  and  low-performing 
students.  However,  the  authors  concluded  that:  “It  will  be  neces¬ 
sary  to  directly  observe  students  in  the  act  of  studying.  Only  in  this 
manner  can  it  be  determined  that  students  actually  do  what  they 
say  in  response  to  such  an  inventory”  (p.  132). 

In  our  present  work,  we  take  up  this  challenge,  and  use  Live- 
scribe  Smartpens™  to  measure  students’  homework  activity. 
These  devices  have  an  integrated  camera  and  are  used  with  dot- 
patterned  paper.  They  serve  the  same  function  as  a  traditional  ink 
pen  and  also  record  the  work  as  timestamped  pen  strokes.  We 
conducted  studies  in  three  offerings  of  a  sophomore-level  under¬ 
graduate  engineering  course  in  statics.  Students  in  these  courses 
completed  their  homework  assignments  using  the  smartpens,  thus 
producing  records  of  the  work  in  the  form  of  timestamped  digi¬ 
tized  pen  strokes. 

Homework 

There  is  encouraging  evidence — much  dating  from  the  1980s 
(Keith,  1982) — for  the  educational  value  of  homework  (Cooper, 
Robinson,  &  Patall,  2006;  Hattie,  2009;  Xu,  2013).  At  the  grossest 
level,  Hattie  (2009)  reported  an  average  effect  size  of  d  =  .29 
favoring  homework,  based  on  five  meta-analyses  involving  295 
experimental  tests  and  over  100,000  students.  In  another  review  of 
research  on  the  relation  between  homework  and  achievement, 
Cooper,  Robinson,  and  Patall  (2006)  found  a  weighted  average 
correlation  of  r  =  .24  based  on  69  separate  correlations.  Impor¬ 
tantly,  the  research  team  found  the  positive  correlation  between 
homework  and  achievement  was  greater  for  older  students  (e.g., 
high  school  students)  than  for  younger  students  (e.g.,  elementary 
school  students). 

Although  early  research  focused  on  the  quantity  of  homework 
activity  (such  as  the  reported  time  spent  on  homework),  Xu  (2013) 
has  proposed  that  the  next  step  in  research  on  homework  is  to  more 
carefully  examine  the  quality  of  homework  activity — including  the 
learner’s  effort  and  activity.  A  methodological  obstacle  to  deter¬ 
mining  the  relation  between  homework  and  achievement  is  that 
much  of  the  existing  research  is  based  on  students’  self-reported 
time  (or  effort)  on  homework  rather  than  on  their  actual  activity.  A 
related  methodological  obstacle  is  that  the  focus  is  on  what  home¬ 
work  is  assigned  by  teachers  rather  than  on  what  is  done  by 
students  as  they  work  on  their  homework. 

The  present  study  overcomes  these  challenges  by  employing  a 
computer-based  technology  for  tracking  the  details  of  students’ 
homework  activity  in  real  time  using  smartpens.  This  technology 
provides  a  level  of  detail  about  what  students  are  doing  and  when 
they  are  doing  it  that  is  not  possible  in  classic  research  on  home¬ 
work.  Thus,  this  technology-enhanced  system  provides  data  for  an 
updated  examination  of  the  connection  between  homework  and 
achievement. 

Theory  and  Predictions 

The  amount  of  time  that  students  choose  to  give  to  a  task  can  be 
considered  a  measure  of  student  engagement  (Hattie,  2009;  van 
Gog,  2013).  Student  engagement  during  learning  is  at  the  heart  of 
theories  of  meaningful  learning  such  as  cognitive  load  theory 
(Sweller,  Ayres,  &  Kalyuga,  2011)  and  the  cognitive  theory  of 
multimedia  learning  (Mayer,  2009,  2014),  and  theories  of  aca¬ 


demic  motivation  such  as  self-efficacy  theory  (Schunk  &  Pajares, 
2009)  and  attribution  theory  (Graham  &  Williams,  2009).  Figure  1 
shows  the  proposed  causes  and  consequences  of  student  engage¬ 
ment  during  learning.  In  terms  of  what  causes  students  to  exert 
effort,  the  left  side  of  Figure  1  proposes  that  instructional  features 
(such  as  interactivity  and  personalization)  and  student  character¬ 
istics  (such  as  self-efficacy  and  interest)  can  prime  the  level  of 
student  effort  during  learning.  A  major  task  of  research  on  instruc¬ 
tional  design  is  to  identify  instructional  features  that  cause  the 
learner  to  exert  effort  to  learn,  and  a  major  task  of  research  on 
academic  motivation  is  to  identify  motivational  beliefs  that  cause 
the  learner  to  exert  effort  to  learn.  In  terms  of  what  are  the 
consequences  of  students  engagement,  the  right  side  of  Figure  1 
shows  that  effort  to  learn  can  lead  to  better  learning  outcomes  as 
indicated  by  measures  of  achievement. 

According  to  this  basic  model  of  academic  learning,  engage¬ 
ment  (as  indicated  by  the  amount  of  time  that  students  allocate  to 
a  task)  is  a  mechanism  affecting  learning  outcomes  (as  indicated 
by  achievement).  Our  focus  in  the  current  study  is  on  the  relation 
between  time  on  a  study  task  (i.e.,  doing  homework  assignments) 
and  grades  in  a  college  course.  Thus,  our  focus  is  on  a  crucial  link 
in  a  model  of  academic  learning.  A  major  new  contribution  is  a 
more  detailed  measurement  of  student  engagement  on  a  study 
activity  (i.e.,  doing  handwritten  homework  assignments)  than  has 
been  previously  available. 

Our  predictions  are  based  on  the  time-on-task  hypothesis  (Hat¬ 
tie,  2009;  van  Gog,  2013),  which  holds  that  learning  new  material 
is  related  to  the  amount  of  time  a  student  is  effortfully  engaged  in 
a  productive  learning  activity.  Productive  learning  activities  are 
those  that  cause  the  student  to  attend  to  relevant  material,  mentally 
organize  it,  and  relate  it  with  relevant  prior  knowledge  (Mayer, 
2009,  2014).  Spending  time  on  homework  is  one  way  to  increase 
productive  learning  time  beyond  the  school  day. 

Time-on-task — defined  as  the  amount  of  time  a  student  spends 
engaged  in  an  academic  task — can  be  “counted  among  the  most 
important  factors  affecting  student  learning  and  achievement”  (van 
Gog,  2013,  p.  432).  Rooted  in  Ebbinghaus’  (1885/1964)  classic 
studies  on  verbal  learning  which  showed  that  time  spent  studying 
a  word  list  is  related  to  the  amount  learned,  time-on-task  has  been 
recognized  as  a  potentially  important  variable  in  academic  learning 
since  the  1960s  (Berliner,  1991;  Carroll,  1963;  van  Gog,  2013).  In 
a  review  of  meta-analyses,  Hattie  (2009)  found  an  average  effect 
size  of  d  =  .38  for  time-on-task  based  on  four  meta-analyses 
examining  136  experimental  comparisons. 

Over  the  years,  the  concept  of  time-on-task  has  evolved  to 
reflect  a  focus  on  engaged  learning  time — time  in  which  the 
learner  is  actively  exerting  effort  on  a  task — rather  than  allocated 
learning  time — time  in  which  the  instructor  provides  opportunities 


Figure  1.  A  model  of  academic  learning. 
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for  learning  (Berliner,  1984;  Karweit,  1984).  Within  engaged 
learning  time,  furthermore,  researchers  have  come  to  focus  on 
productive  learning  time — time  in  which  the  learner  is  exerting 
effort  to  learn  on  an  appropriate  academic  task  (Berliner,  1984). 
For  example,  van  Gog,  Ericsson,  Rikers,  and  Paas  (2005)  point  to 
the  role  of  deliberate  practice — spending  extended  periods  of  time 
effortfully  engaged  in  tasks  at  an  appropriate  level  of  challenge 
that  allow  for  continual  improvement.  Early  work  by  Anderson 
(1993)  provides  an  exemplary  demonstration  of  the  role  of  practice 
time  in  learning  with  computer-based  cognitive  tutors,  and  current 
work  continues  to  demonstrate  the  positive  impact  of  solving 
practice  problems  in  e-learning  (Clark  &  Mayer,  2011). 

Although  learning  mechanisms  were  not  highlighted  in  the  early 
conceptions  of  time-on-task,  the  updated  version  of  the  time-on- 
task  hypothesis  is  consistent  with  the  idea  that  meaningful  learning 
requires  active  cognitive  processing  in  working  memory  during 
learning  such  as  attending  to  relevant  information,  mentally  orga¬ 
nizing  it  into  a  coherent  structure,  and  relating  it  to  relevant  prior 
knowledge  activated  from  long-term  memory  (Mayer,  2011). 
What  turns  learning  time  into  productive  learning  time  is  that  the 
learner  is  engaged  in  appropriate  cognitive  processing  on  appro¬ 
priate  tasks  during  learning — processing  that  leads  to  constructing 
new  knowledge  and  skills. 

Based  on  these  revisions  in  the  classic  concept  of  time-on-task, 
we  expand  the  time-on-task  hypothesis  to  focus  also  on  the  quality 
of  time  spent  on  homework.  Overall,  we  examine  three  predictions 
about  the  relation  between  homework  and  achievement  concerning 
the  quantity  of  time  (i.e.,  how  much)  and  the  quality  of  time  (i.e., 
when). 

1.  How  much:  The  most  straightforward  prediction  of  the 
time-on-task  hypothesis  is  that  time  spent  solving  home¬ 
work  problems  is  related  to  course  grade.  However,  a 
problem  with  traditional  research  on  homework  is  that 
some  studies  use  self-reported  estimates  of  time  spent 
doing  homework.  An  important  improvement  in  the  cur¬ 
rent  technology-supported  study  is  that  we  have  access  to 
the  actual  time  that  students  were  working  on  their  home¬ 
work  problems,  including  when  they  started  and  ended 
each  session. 

2.  When:  In  addition  to  focusing  solely  on  time  spent  on 
homework,  a  more  sophisticated  approach  is  to  measure 
the  quality  of  the  time,  such  as  the  degree  to  which  the 
homework  activity  was  performed  in  advance  of  the 
deadline  for  submission.  Although  traditional  research  on 
homework  generally  does  not  include  measures  of  when 
the  homework  was  done,  our  technology-supported  en¬ 
vironment  allows  us  to  test  the  prediction  that  doing 
homework  farther  in  advance  of  the  deadline  is  related  to 
course  grade. 

3.  How  many:  In  addition  to  focusing  solely  on  time  spent 
on  homework,  a  more  sophisticated  approach  is  to  mea¬ 
sure  how  the  time  was  spent.  This  challenge  is  problem¬ 
atic  with  traditional  research  on  homework  that  does  not 
involve  in-process  measures  of  homework  activity.  How¬ 
ever,  in  our  technology-supported  environment,  a 
straightforward  way  to  measure  the  amount  of  effort  put 


into  doing  homework  is  to  count  the  number  of  strokes 
performed  in  solving  homework  problems.  This  allows 
us  to  test  a  more  focused  version  of  the  time-on-task 
hypothesis:  number  of  strokes  performed  while  solving 
homework  problems  is  related  to  course  grade. 

We  examine  these  three  predictions,  and  related  predictions, 
across  three  cohorts  of  engineering  students  enrolled  in  an  intro¬ 
ductory  course  in  statics. 

Related  Research  on  Data  Mining  in  Education 

Educational  data  mining  with  computer-based  instructional  sys¬ 
tems  has  a  rich  history  dating  back  to  large-scale  studies  of  computer- 
assisted  instruction  (CAI)  in  schools  in  the  1960s  (e.g.,  Atkinson, 
1968),  extensive  use  of  log  files  for  modeling  student  learning  with 
computer-based  cognitive  tutors  (Anderson,  1993),  and  the  subse¬ 
quent  use  of  log  files  with  intelligent  tutoring  systems  (Koedinger, 
D’Mello,  McLaughlin,  Pardos,  &  Rose,  2015).  In  recent  years,  re¬ 
searchers  have  made  significant  progress  in  educational  data  mining 
or  EDM  (Koedinger,  D’Mello,  McLaughlin,  Pardos,  &  Rose,  2015; 
Romero,  Romero,  Luna,  &  Ventura,  2010).  Much  of  the  data  used  in 
this  work  is  extracted  from  log  files  of  intelligent  tutoring  systems 
(Beal  &  Cohen,  2008;  Li,  Cohen,  Koedinger,  &  Matsuda,  2011; 
Mostow,  Gonzalez-Brenes,  &  Tan,  2011;  Shanabrook,  Cooper, 
Woolf,  &  Arroyo,  2010;  Stevens,  Johnson,  &  Soller,  2005;  Trivedi, 
Pardos,  Srakozy,  &  Heffeman,  2011)  and  learning  management  sys¬ 
tems  such  as  Moodle  and  Blackboard  (Kruger,  Merceron,  &  Wolf, 
2010;  Romero,  Ventura,  Vasilyeva,  &  Pechenizkiy,  2010).  This  work 
relies  on  a  variety  of  data  mining  techniques  including  clustering 
(Antonenko,  Toy,  &  Niederhauser,  2012;  Stevens  et  al.,  2005;  Trivedi 
et  al.,  2011),  model  prediction  (Li  et  al.,  2011;  Mostow  et  al.,  2011; 
Stevens  et  al.,  2005),  and  sequence  analysis  (Beal  &  Cohen,  2008; 
Kruger  et  ah,  2010;  Romero,  Romero,  et  ah,  2010;  Shanabrook  et  al., 
2010). 

Our  work  differs  from  this  in  that  we  record  and  mine  data  from 
learning  activities  involving  writing  on  paper,  rather  than  activities 
involving  typing  on  a  computer  keyboard.  The  work  of  Oviatt, 
Arthur,  and  Cohen  (2006)  suggests  that  natural  work  environments 
are  critical  to  student  performance.  In  their  examinations  of  com¬ 
puter  interfaces  for  completing  geometry  problems,  they  found  that 
“as  interfaces  departed  more  from  familiar  work  practice  .  .  .  , 
students  would  experience  greater  cognitive  load  such  that  perfor¬ 
mance  would  deteriorate  in  speed,  attentional  focus,  metacognitive 
control,  correctness  of  problem  solutions,  and  memory”  (p.  191). 
Similarly,  Anthony,  Yang,  and  Koedinger  (2008)  found  that  hand¬ 
writing  interfaces  were  more  beneficial  than  keyboard  interfaces 
for  math  tutoring  systems.  Mueller  and  Oppenheimer  (2014)  made 
a  similar  finding  in  relation  to  note-taking.  They  examined  student 
note-taking  using  both  longhand  and  laptops,  and  found  that  the 
latter  can  lead  to  shallower  processing.  Lectures  were  shown  on  a 
screen,  with  students  taking  notes,  followed  by  distractor  tasks. 
Using  a  model  including  both  word  count  and  verbatim  overlap 
(three-word  chunks  from  student  notes  matching  the  lecture  tran¬ 
script),  they  were  able  to  predict  performance  on  a  test  of  the 
lecture  material  with  a  correlation  coefficient  of  r  =  .41. 

Macfadyen  and  Dawson  (2010)  mined  data  from  a  learning 
management  system  (LMS)  to  predict  final  course  grade.  Their 
best  model  was  able  to  explain  33%  of  the  variance  in  grade 
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utilizing  three  features:  the  number  of  mail  messages  sent,  the 
number  of  assessments  finished,  and  the  total  number  of  discussion 
messages  posted.  This  provides  some  insights  about  the  relation¬ 
ship  between  studying  and  course  performance.  However,  the  type 
of  data  available  from  a  LMS— such  as  records  of  downloading 
course  materials  and  submitting  electronic  assignments — does  not 
provide  a  direct  measurement  of  students’  homework  activity.  We 
use  smartpens  to  capture  a  fine-grained  record  of  students’  hand¬ 
written  homework. 

Researchers  have  used  video  recording  to  analyze  students’ 
problem-solving  activities  (Blanc,  1999;  Hall,  2000).  While  this 
approach  provides  a  detailed  record  of  student  work,  the  analysis 
is  time-consuming.  For  example,  Blanc  (1999)  made  75  recordings 
of  students  solving  mathematics  problems,  but  analyzed  only  two 
of  the  recordings.  This  sort  of  video  analysis  would  be  intractable 
in  our  studies,  which  involve  hundreds  of  students  completing 
homework  throughout  a  quarter-long  course.  For  our  studies, 
smartpens  provide  a  convenient  and  scalable  approach  for  captur¬ 
ing  high-resolution,  timestamped  records  of  problem-solving 
work. 

There  have  been  prior  studies  examining  learning  activities  in 
statics.  For  example,  work  by  Steif  and  Dollar  (2009)  examined 
usage  patterns  of  a  web-based  statics  tutoring  system  and  found 
that  learning  gains  increased  with  the  number  of  tutorial  elements 
completed.  Similarly,  work  by  Steif,  Lobue,  Kara,  and  Fay  (2010) 
examined  whether  students  can  be  induced  to  talk  about  the  bodies 
in  a  statics  problem,  and  if  doing  so  can  increase  a  student’s 
performance.  They  used  tablet  PCs  to  record  the  students’  spoken 
explanations  and  their  handwritten  solutions,  but  the  written  work 
was  left  mostly  unanalyzed. 

Researchers  have  only  recently  begun  using  smartpens  for  as¬ 
sessment.  For  example,  Herold  and  Stahovich  (2012)  used  smart- 
pens  to  examine  the  homework  of  students  who  were  asked  to 
provide  self-explanations  for  their  solutions  to  statics  problems. 
The  study  found  that  students  who  generated  self-explanations 
were  more  likely  to  complete  homework  problems  in  the  order 
assigned  (i.e.,  complete  one  problem  before  beginning  the  next) 
than  were  students  who  did  not  generate  self-explanations. 

Our  work  builds  on  that  of  Rawson  and  Stahovich  (2013)  who 
used  smartpens  as  part  of  a  technique  for  making  early  predictions 
of  student  success  or  failure  in  a  statics  course.  They  used  smart- 
pens  to  record  students’  work  on  one  homework  assignment  and  a 
corresponding  quiz  given  early  in  the  course.  They  computed  a 
number  of  features  from  this  digital  ink  data  including,  for  exam¬ 
ple,  the  total  time  spent  on  the  homework  and  the  amount  of  ink 
written.  By  themselves,  these  features  were  only  weakly  predictive 
of  a  student’s  course  performance.  However,  when  combined  with 
a  concept  inventory  score  (Steif  &  Dantzler,  2005),  these  features 
produced  useful  early  predictions. 

In  our  work,  we  employ  many  of  the  ink  features  they  devel¬ 
oped.  However,  our  goals  are  different.  While  their  goal  was  to  use 
data  collected  at  the  beginning  of  a  course  to  make  early  predic¬ 
tions  of  success  and  failure,  ours  is  to  understand  the  relationship 
between  homework  habits  and  course  performance.  Our  analysis 
considers  homework  behavior  over  the  entire  duration  of  a  course, 
while  they  considered  work  from  only  a  single  assignment  and 
quiz. 

Recently,  Van  Arsdale  and  Stahovich  (2012)  demonstrated  that 
the  spatial  and  temporal  organization  of  a  student’s  solution  to  an 


engineering  problem  is  indicative  of  the  correctness  of  that  solu¬ 
tion.  They  recorded  students’  work  on  exam  problems  using  smart- 
pens  and  characterized  the  problem-solving  activity  in  terms  of  the 
sequence  of  problem-solving  steps  and  the  arrangement  of  the 
work  on  the  page.  While  they  focused  on  a  microscale  analysis  of 
problem-solving  behavior  on  individual  exam  problems,  we  con¬ 
sider  a  macroscale  analysis  of  homework  habits  over  the  duration 
of  a  course. 

Herold,  Stahovich,  and  Rawson  (2013)  used  smartpens  to  ex¬ 
amine  the  correlation  between  effort  on  a  homework  assignment 
and  grade  on  that  assignment.  They  characterized  effort  in  terms  of 
the  amount  of  time  spent  and  the  amount  of  ink  written.  They  also 
examined  transfer  from  homework  problems  to  subsequent  home¬ 
work,  quiz,  and  exam  problems.  They  characterized  problem¬ 
solving  work  by  the  amount  of  time  the  pen  was  in  contact  with  the 
paper,  which  is  only  a  fraction  of  the  time  spent  on  the  problem. 
They  found  that  this  “writing  time”  was  correlated  with  perfor¬ 
mance  on  subsequent  problems.  Our  work  is  similar  in  that  we  also 
examine  the  relationship  between  homework  activity  and  success. 
However,  we  consider  a  longer  time  scale  and  our  focus  is  under¬ 
standing  how  homework  habits  over  an  entire  course  relate  to 
success  in  that  course. 

Method 

Participants  and  Course  Setting 

The  participants  were  three  cohorts  of  undergraduate  engineer¬ 
ing  students  at  the  University  of  California,  Riverside  who  were 
enrolled  in  an  entry-level  course  in  statics — 92  students  in  the 
winter  quarter  of  2010  (Year  1),  109  students  in  the  winter  quarter 
of  2011  (Year  2),  and  127  students  in  the  winter  quarter  of  2012 
(Year  3).  The  winter  term  is  the  first  offering  of  the  statics  course 
for  the  academic  year.  The  majority  of  the  students  in  the  course 
are  from  mechanical  engineering,  although  students  from  several 
other  engineering  majors,  including  materials  science  and  environ¬ 
mental  engineering,  also  take  the  course.  Mechanical  engineering 
students  typically  take  the  course  in  the  sophomore  year.  The 
course  includes  two  80-min  lecture  periods  per  week.  Students  also 
attend  a  50-min  discussion  section  each  week.  The  course  employs 
a  traditional  lecture  format. 

Statics  is  the  part  of  engineering  mechanics  focused  on  the 
equilibrium  of  objects  subject  to  forces.  The  solution  to  a  statics 
problem  typically  includes  free  body  diagrams  and  equilibrium 
equations.  The  former  represent  the  forces  acting  on  a  system, 
while  the  latter  are  the  application  of  Newton’s  Second  Law. 
Figure  2  shows  a  typical  homework  problem  from  the  course  and 
Figure  3  shows  the  sort  of  solution  a  student  might  generate  for 
this  problem.  This  image  was  constructed  from  digitized  pen 
strokes  captured  with  a  smartpen. 

In  Year  1  students  used  Newton’s  Pen,  an  intelligent  tutoring 
system  for  statics  (Lee,  Stahovich,  &  Calfee,  2011).  This  system 
was  utilized  during  several  discussion  periods. 

In  Year  2  there  were  four  separate  discussion  sections,  each  of 
which  was  provided  with  one  of  three  different  experimental 
treatments.  Students  from  two  discussion  sections  were  asked  to 
provide  self-explanations  for  the  problem-solving  steps  for  six  of 
the  homework  assignments.  These  students  were  provided  with 
self-explanation  prompts  for  these  assignments.  Students  in  a  third 
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The  coefficient  of 
static  friction 
between  block  A  and 
its  incline  is  0.25. 
What  must  the 
minimum  coefficient 
of  static  friction 
between  block  B  and 
its  incline  be,  if  the 
blocks  are  in 
equilibrium?  Neglect 
friction  in  the  pulley. 


Figure  2.  A  typical  statics  problem. 


discussion  section  used  Newton’s  Pen  during  some  of  the  discus¬ 
sion  periods.  The  fourth  discussion  section  served  as  the  control. 
Students  in  this  section  did  not  provide  self-explanation,  nor  did 
they  use  Newton’s  Pen. 

In  Year  3  students  were  randomly  assigned  to  one  of  six  exper¬ 
imental  groups.  Four  of  the  groups  were  asked  to  provide  self¬ 
explanation  for  the  problem-solving  steps  on  their  homework. 
Each  of  these  four  groups  was  provided  with  varying  amounts  of 
scaffolding  for  self-explanation.  Students  in  a  fifth  group  used 
Newton’s  Pen  during  some  discussion  periods.  Students  in  the 
sixth  group  served  as  the  control.  For  the  final  homework  assign¬ 
ment,  all  students  were  prompted  to  provide  self-explanation  with¬ 
out  scaffolding.  Also,  in  some  discussion  periods,  students  were 
given  problems  to  solve.  They  began  the  problems  in  discussion, 
and  if  necessary  completed  them  later.  They  submitted  these 
solutions  with  their  homework. 

Course  grade  for  all  three  cohorts  was  based  on  the  following 
weighting:  10%  for  the  homework  score,  10%  for  the  quiz  score, 
10%  for  the  project  score,  20%  for  the  first  midterm  exam  score, 
20%  for  the  second  midterm  exam  score,  and  30%  for  the  final 
exam  score.  The  exams  and  quizzes  were  not  identical  across 
cohorts  but  the  content  and  format  were  similar.  For  example,  for 
all  3  years  the  first  midterm  included  one  problem  requiring 
students  to  compute  a  moment,  one  problem  involving  equilibrium 
analysis  of  a  two-dimensional  system,  and  one  problem  involving 
equilibrium  of  a  three-dimensional  system.  All  problems,  except 
an  ethics  problem  on  the  final  exam,  required  free-form  solutions, 
which  typically  required  one  or  more  free  body  diagrams  and 
equilibrium  equations.  Problems  were  graded  using  a  rubric  that 
examined  the  correctness  of  the  major  elements  of  the  solution.  For 
example,  an  equilibrium  problem  might  include  a  free  body  dia¬ 
gram,  geometric  calculations,  and  equilibrium  equations.  The 
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Figure  3.  A  solution  to  the  statics  problem  from  Figure  2. 


credit  for  the  problem  would  be  divided  over  these  elements 
according  to  their  complexity,  with  more  points  being  assigned  to 
the  more  challenging  elements.  If  an  element  was  missing,  the 
student  would  receive  no  credit  for  that  element.  Points  were 
deduced  from  each  element  for  various  types  of  errors  such  as  sign 
errors,  missing  terms  (e.g.,  a  missing  force  in  a  force  equilibrium 
equation),  incorrect  terms  (e.g.,  using  “sine”  instead  of  “cosine”), 
and  so  forth. 

Procedure 

Beginning  in  the  third  week  of  the  course,  students  used  smart- 
pens  to  complete  all  homework  assignments,  quizzes,  and  exams. 
Students  were  instructed  to  use  their  smartpen  instead  of  a  pencil. 
We  did  not  collect  data  from  the  first  two  homework  assignments 
and  quizzes.  In  Years  1  and  2,  there  were  a  total  of  nine  homework 
assignments,  and  we  collected  data  from  the  last  seven.  In  Year  3, 
there  was  a  total  of  eight  assignments  and  we  collected  data  from 
the  last  six.  In  all  years  we  collected  data  from  five  quizzes  (all 
quizzes  except  the  first  two),  two  midterm  exams,  and  one  com¬ 
prehensive  final  exam.  In  Year  1,  the  seven  homework  assign¬ 
ments  comprised  a  total  of  41  problems,  in  Year  2  there  were  44 
problems,  and  in  Year  3  there  were  40.  The  instructor  was  aware 
of  the  general  goal  of  the  study — to  capture  student  problem¬ 
solving  data  from  the  homework  that  could  be  related  to  course 
performance — but  the  data  were  not  analyzed  until  after  each 
cohort  completed  the  course  and  received  their  final  grades,  thus 
eliminating  the  possibility  of  bias  in  assigning  grades. 

Livescribe  Smartpens  create  two  records:  ink  on  paper  and 
timestamped  digitized  pen  strokes.  In  Year  1,  students  submitted 
both  the  paper  copy  of  each  assignment  and  their  smartpens.  We 
extracted  the  data  from  the  smartpens  and  returned  them  to  the 
students  so  they  could  complete  their  next  assignment.  For  Year  2, 
we  developed  software  to  enable  students  to  submit  their  assign¬ 
ments  electronically.  To  do  this,  a  student  docked  the  smartpen  to 
a  PC  using  a  USB  cable.  Our  software  then  extracted  the  ink  data 
and  submitted  it  to  a  server  for  grading.  We  graded  the  homework 
electronically  and  returned  it  as  a  PDF.  In  Year  2,  electronic 
submission  was  optional.  Students  could  still  submit  the  paper 
copy  of  an  assignment,  in  which  case  we  extracted  the  ink  data 
form  the  smartpen  at  the  end  of  the  quarter.  To  encourage  students 
to  submit  their  work  electronically,  for  some  assignments  the  due 
date  for  electronic  submission  was  several  hours  later  than  for 
paper  submission.  In  Year  3,  all  students  were  required  to  submit 
their  work  electronically.  However,  if  a  student  had  technical 
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difficulties  submitting  a  particular  assignment,  he  or  she  could  still 
submit  it  on  paper  and  we  extracted  the  ink  data  at  the  end  of  the 
quarter. 

In  Years  2  and  3,  some  students  provided  self-explanation  with 
their  homework.  As  self-explanation  was  not  the  focus  of  this 
project,  and  to  maintain  consistency  across  all  students,  we  ex¬ 
cluded  the  ink  for  the  explanations  from  our  analysis.  However,  we 
did  include  the  self-explanation  ink  from  the  last  assignment  in 
Year  3,  as  all  students  provided  self-explanation  for  that  assign¬ 
ment.  In  Year  3,  some  homework  submissions  included  problems 
that  were  solved  in  part  during  a  discussion  period.  We  excluded 
these  problems  from  our  analysis  as  they  are  not  typical  homework 
problems. 

The  Livescribe  Smartpens  have  two  clocks.  One  is  used  to 
display  the  current  time  of  day,  while  the  other  is  used  to  create 
timestamps  for  the  pen  strokes.  The  former  can  be  adjusted,  while 
the  latter  cannot.  Having  a  nonadjustable  clock  for  time  stamps 
ensures  that  the  time  of  the  pen  strokes  is  correct,  even  when  there 
is  a  change  to  or  from  daylight  saving  time,  for  example.  We  used 
the  time  of  an  exam  to  determine  the  offset  between  the  timestamp 
clock  and  the  actual  time  of  day.  For  Year  3,  we  also  directly 
measured  the  offset  before  distributing  the  pens  to  the  students. 
With  this  calibration  approach,  the  offset  of  the  timestamp  clock  is 
accurate  to  within  about  5  min,  which  is  adequate  for  our  purposes. 

For  all  3  years,  we  conducted  a  survey  at  the  end  of  the  course 
with  questions  about  demographics,  study  habits,  and  perceptions 
about  the  course  and  instructional  technology  used. 

Data  Mining  With  Smartpen  Technology 

We  developed  software  to  enable  us  to  manually  partition  students’ 
ink  data  into  the  individual  problems  comprising  each  assignment, 
quiz,  and  exam.  The  software  renders  the  ink  data  on  a  computer 
display,  enabling  one  to  navigate  through  the  pages  of  writing.  A 
mouse  is  used  to  select  the  ink  for  an  individual  problem  and  assign 
a  problem  number  to  it.  We  then  use  software  (Lin,  Stahovich,  & 
Herold,  2012)  to  automatically  label  each  pen  stroke  as  either  an 
equation,  free  body  diagram,  or  cross-out  stroke  (see  Figure  3). 

Once  the  digital  ink  has  been  partitioned  into  problems  and 
labeled,  we  computed  13  quantitative  measures  to  characterize  a 
student’s  homework  activity,  as  summarized  in  Table  1.  Our  first 


measure,  total  homework  time,  is  the  total  time  spent  to  complete 
all  of  the  homework  assignments.  We  define  the  time  to  complete 
one  assignment  as  the  time  from  the  first  pen  stroke  of  the 
assignment  to  the  last,  excluding  any  periods  of  inactivity  longer 
than  10  min.  Any  long  inactivity  periods  partition  the  homework 
effort  into  sessions.  Consecutive  pen  strokes  within  a  session  are 
never  more  than  10  min  apart,  while  strokes  from  different  ses¬ 
sions  are  always  at  least  10  min  apart. 

We  use  three  measures  to  characterize  the  time  effort  over  the 
assignment  period.  Due  date  ink  fraction,  computed  as  the  fraction  of 
the  pen  strokes  written  within  24  hr  of  the  due  date,  measures  the 
extent  to  which  students  wait  until  the  “last  minute”  to  complete  an 
assignment.  Similarly,  late  night  ink  fraction,  computed  as  the  fraction 
of  the  pen  strokes  written  between  midnight  and  4  a.m.,  measures  the 
fraction  of  work  done  late  at  night.  Finally,  number  of  homework 
sessions  is  simply  the  total  number  of  sessions  required  to  complete 
the  assignments,  with  a  new  session  counted  when  there  is  at  least  a 
10-min  break  from  the  previous  pen  stroke. 

In  addition  to  considering  the  amount  of  time  spent  on  home¬ 
work,  we  also  consider  the  amount  of  writing.  As  the  name 
suggests,  total  strokes  is  the  total  number  of  pen  strokes  written  to 
complete  the  assignments.  We  also  count  the  number  of  equation 
strokes,  the  number  of  diagram  strokes,  and  the  number  of  cross- 
out  strokes.  These  measures  are  computed  using  the  auto-labeler 
from  Lin,  Stahovich,  and  Herold  (2012).  In  addition  to  stroke 
count,  we  also  consider  the  length  of  the  pen  strokes.  Total  ink 
length,  which  is  computed  in  units  of  inches,  is  the  total  distance 
the  pen  tip  travels  on  the  paper. 

We  use  three  measures  to  characterize  effort  on  individual  home¬ 
work  problems.  Problems  attempted  is  the  number  of  problems  for 
which  the  student  wrote  at  least  50  pen  strokes.  It  is  unlikely  that  a 
student  made  significant  progress  on  a  problem  if  he  or  she  wrote 
fewer  strokes  than  this.  For  example,  simply  writing  “Problem  1” 
takes  at  least  eight  strokes.  Average  time  per  problem  is  the  ratio  of 
total  homework  time  and  problems  attempted.  This  provides  a  means 
of  comparing  the  effort  of  students  even  if  they  did  not  complete  the 
same  number  of  problems.  Average  pen  speed  is  the  ratio  of  total  ink 
length  and  total  homework  time.  This  measure  characterizes  the  pace 
of  the  work.  Finally,  the  out  of  order  measure  describes  the  frequency 
with  which  a  student  works  nonsequentially.  Prior  work  has  found 


Table  1 

Thirteen  Measures  Derived  Through  Smartpen  Technology 


Measure  Description 


Total  homework  time 

Due  date  ink  fraction 

Late  night  ink  fraction 

Number  of  homework  sessions 

Total  strokes 

Equation  strokes 

Diagram  strokes 

Cross-out  strokes 

Total  ink  length 

Problems  attempted 

Average  time  per  problem 

Average  pen  speed 

Out  of  order 


Total  time  to  complete  all  assignments 

Proportion  of  pen  strokes  written  within  24  hr  of  due  date 

Proportion  of  pen  strokes  written  between  midnight  and  4  a.m. 

Number  of  sessions  used  to  complete  the  assignments 
Number  of  pen  strokes  written  to  complete  the  assignments 
Number  equation  pen  strokes  written  to  complete  the  assignments 
Number  of  diagram  pen  strokes  written  to  complete  the  assignments 
Number  of  cross-out  pen  strokes  written  to  complete  the  assignments 
Total  distance  (in  inches)  the  pen  travels  on  paper  for  all  assignments 
Number  of  problems  for  which  the  student  wrote  at  least  50  pen  strokes 
Total  homework  time  divided  by  problems  attempted 
Total  ink  length  divided  by  total  homework  time 

Number  of  times  a  student  transitions  to  a  problem  other  than  the  next  one  in 
the  assignment 
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that  expert  students  often  solve  problems  in  the  order  assigned,  while 
novice  students  may  begin  one  problem  and  then  move  on  to  another 
before  completing  the  former  (Herold,  Stahovich,  Lin,  &  Calfee, 
2011).  The  out  of  order  measure  is  the  number  of  times  a  student 
transitions  to  a  problem  other  than  the  next  one.  For  example,  the 
sequence  of  problems  1,3,  1,  2  has  an  out  of  order  value  of  two.  The 
transitions  from  1  to  3  and  3  to  1  are  nonsequential. 

When  computing  these  measures,  we  exclude  any  ink  that  was 
written  more  than  5  min  prior  to  the  time  the  homework  assignment 
was  posted.  This  tolerance  compensates  for  the  5-min  uncertainty  in 
our  timestamp  clock  calibration.  We  also  exclude  any  ink  written 
more  than  an  hour  after  an  assignment  due  date.  As  our  electronic 
submission  system  did  not  prevent  late  submissions,  some  students 
did  submit  their  homework  late.  We  include  any  pen  strokes  written 
during  this  past-due  hour  in  the  due  date  ink  fraction. 

One  of  the  questions  on  the  end-of-class  survey  asked  students  to 
report  the  amount  of  time  it  took  on  average  to  complete  a  homework 
assignment,  which  we  used  to  compute  self-reported  time  on  home¬ 
work.  For  the  first  2  years,  the  available  choices  for  answering  the 
question  were:  less  than  2  hr,  2-4  hr,  4-6  hr,  6-8  hr,  8-10  hr,  and 
more  than  10  hr.  In  Year  3,  the  choices  were  reduced  by  one  so  that 
the  last  choice  was  “more  than  8  hr.”  When  computing  the  total  time 
spent  on  homework,  we  consider  a  student’s  average  assignment  time 
to  be  the  midpoint  of  the  selected  interval.  However,  if  the  student 
selected  the  largest  choice,  we  use  the  lower  bound  (i.e.,  either  8  or  10 
hr).  For  example,  if  a  student  in  Year  1  reported  “2-4  hr,”  we  would 
compute  the  total  self-reported  time  over  the  seven  homework  assign¬ 
ments  to  be  21  hr.  Similarly,  if  they  reported  “more  than  10  hr”  we 
would  compute  the  value  to  be  70  hr. 

Results 

Data  Set 

Our  dataset  includes  data  on  13  measures  from  a  total  of  328 
students:  92  from  Year  1,  109  from  Year  2,  and  127  from  Year  3. 
All  of  these  students  completed  the  course  and  received  a  final 
course  grade.  We  excluded  data  from  one  student  in  Year  2  and 
four  from  Year  3  because  their  digital  ink  data  was  corrupted. 


As  described  in  the  Method  section,  some  students  in  Years  2 
and  3  were  asked  to  write  self-explanations  and  some  others  used 
an  intelligent  tutoring  system.  We  wanted  to  determine  whether  the 
same  pattern  of  results  could  be  obtained  in  different  contexts.  We 
performed  one-way  analysis  of  variance  (ANOVA)  to  determine  if 
these  treatments  led  to  any  significant  differences  in  final  grades 
between  the  experimental  and  control  groups.  In  both  cases,  the 
differences  were  not  significant  ( p  =  .706  for  Year  2  and  p  =  .957 
for  Year  3)  and  thus,  in  our  analysis,  we  ignore  these  distinctions 
between  students. 

Table  2  shows  the  correlation  between  each  of  the  1 3  measures 
and  course  grade  for  all  students,  and  for  each  cohort  separately, 
with  significant  correlations  at  p  <  .05  denoted  with  an  asterisk. 
We  focus  on  the  results  for  all  students,  and  view  the  cohort  data 
as  a  form  of  replication.  Table  3  shows  the  means  and  standard 
deviations  of  each  of  the  1 3  measures  for  all  students,  and  for  each 
cohort  separately. 

Some  of  our  measures  are  sensitive  to  the  number  of  problems 
assigned.  As  the  number  of  homework  problems  varied  between 
the  three  cohorts,  we  performed  another  analysis  in  which  we 
normalized  the  features  by  the  number  of  problems  assigned  to  the 
cohort.  Four  features — due  date  ink  fraction,  late  night  ink  frac¬ 
tion,  average  time  per  problem,  and  average  pen  speed — did  not 
require  normalizing  as  they  are  insensitive  to  the  number  of 
problems.  Normalizing  the  measures  produced  only  a  negligible 
change  in  the  correlation  with  course  grade.  The  correlations 
changed  by  less  than  .01  (and  p  by  less  than  .003)  for  all  measures. 

We  also  investigated  whether  gender  is  significant  to  course 
performance.  For  Cohort  1  the  average  score  for  male  students  was 
.71  (n  —  78),  while  the  average  score  for  female  students  was  .65 
( n  =  12).  However,  this  difference  in  means  was  nonsignificant, 
with  p  =  .212.  Similarly  for  Cohort  2,  the  average  score  for  male 
students  was  .66  (n  =  87),  while  the  average  score  for  female 
students  was  .63  (n  =  16).  This  difference  in  means  was  again 
nonsignificant,  with  p  =  .530.  For  Cohort  3,  the  average  score  for 
male  students  was  .68  ( n  =  105),  while  the  average  score  for 
females  was  .61  ( n  =  19).  This  difference  between  means  was 
significant,  with  p  —  .028. 


Table  2 

Correlation  Between  Course  Grade  and  Each  of  13  Smartpen  Measures  for  all  Students  and 
Each  Cohort  Separately 


Measure 

All  students 

Cohort  1 

Cohort  2 

Cohort  3 

Total  homework  time 

.44* 

.42* 

.59* 

.31* 

Due  date  ink  fraction 

-.32* 

-.38* 

-.48* 

-.20* 

Late  night  ink  fraction 

-.06 

-.08 

-.15 

-.04 

Number  of  homework  sessions 

.33* 

.05 

.58* 

.27* 

Total  strokes 

.49* 

.55* 

.60* 

.40* 

Equation  strokes 

.49* 

.54* 

.61* 

*  .40* 

Diagram  strokes 

.41* 

.46* 

.51* 

.34* 

Cross-out  strokes 

.32* 

.33* 

.34* 

.33* 

Total  ink  length 

.42* 

.44* 

.50* 

.39* 

Problems  attempted 

.45* 

.35* 

.68* 

.27* 

Average  time  per  problem 

.33* 

.32* 

.39* 

.29* 

Average  pen  speed 

-.02 

.02 

-.11 

.07 

Out  of  order 

.10 

-.17 

.27* 

.09 

*  p  <  .05. 
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Table  3 


Means  and  Standard  Deviations  for  Each  of  13  Smartpen  Measures  for  all  Students  and  Each  Cohort  Separately 


Measure 

All  students 

Cohort  1 

Cohort  2 

Cohort  3 

P 

a 

P 

a 

P 

a 

P 

a 

Total  homework  time  (hr) 

17.1 

8.5 

17.7 

6.4 

16.9 

9.1 

17.0 

9.2 

Due  date  ink  fraction 

.7 

.3 

.7 

.2 

.7 

.2 

.6 

.3 

Late  night  ink  fraction 

.1 

.1 

.2 

.2 

.1 

.1 

.1 

.1 

Number  of  homework  sessions 

36.0 

19.6 

38.8 

16.7 

32.0 

18.1 

37.3 

22.3 

Total  strokes 

20189.5 

9186.6 

18944.0 

6880.2 

21162.0 

10002.3 

20257.0 

9855.3 

Equation  strokes 

14858.8 

6939.1 

13975.2 

5411.3 

15411.0 

7330.4 

15025.1 

7542.9 

Diagram  strokes 

4925.9 

2373.6 

4572.3 

1708.0 

5318.8 

2767.8 

4844.9 

2390.9 

Cross-out  strokes 

404.7 

278.7 

396.5 

207.6 

432.2 

359.9 

387.0 

241.8 

Total  ink  length  (inches) 

5936.2 

3030.1 

5568.8 

2308.0 

5979.0 

3252.0 

6165.7 

3280.8 

Problems  attempted 

34.4 

8.9 

35.7 

5.9 

35.5 

10.2 

32.4 

9.2 

Average  time  per  problem  (min) 

29.1 

11.3 

29.5 

9.8 

27.6 

10.8 

30.0 

12.6 

Average  pen  speed  (inches/second) 

.104 

.045 

.093 

.037 

.105 

.038 

.112 

.053 

Out  of  order 

20.2 

15.3 

17.7 

11.1 

22.6 

19.6 

20.1 

13.5 

We  also  examined  the  correlation  between  measures  of  prior 
knowledge  and  course  performance.  Here  we  use  two  measures  to 
quantify  prior  knowledge:  the  student’s  SAT  score  (based  on 
combined  verbal,  quantitative,  and  writing  scores)  and  their  high 
school  GPA.  For  Cohort  1  (/•  =  .534,  p  <  .001)  and  Cohort  3  (r  = 
.284,  p  =  .003),  there  was  a  significant  correlation  between  SAT 
score  and  final  course  grade,  but  not  for  Cohort  2  (r  =  .091,  p  = 
.378).  The  correlation  between  high  school  GPA  and  final  course 
grade  was  significant  for  Cohort  1  (r  =  .317,  p  =  .003)  and  Cohort 
2  (r  =  .285,  p  =  .004)  but  not  for  Cohort  3  (r  =  .184,  p  =  .052). 

We  performed  a  stepwise  linear  regression  to  examine  the 
predictive  ability  of  our  entire  set  of  measures.  In  computing  a 
stepwise  model  we  required  the  probability  of  F  <  .05  to  enter  a 
measure,  and  the  probability  of  F  2:  .10  to  remove  a  measure.  We 
initialized  the  model  by  including  three  measures:  total  strokes , 
total  homework  time ,  and  problems  attempted.  For  all  students, 
total  strokes,  problems  attempted,  out  of  order  and  due  date  ink 
fraction  were  selected  with  r  =  .57,  p  <  .001.  For  Cohort  1,  total 
strokes  and  out  of  order  were  selected  with  r  =  .67,  p  <  .001.  For 
Cohort  2,  total  strokes,  problems  attempted,  and  due  date  ink 
fraction  were  selected  with  r  =  .72,  p  <  .001.  For  Cohort  3,  only 
total  strokes  was  selected  with  r  =  .40,  p  <  .001.  Thus,  in  the 
analysis  with  the  best  statistical  power  (i.e.,  the  combined  data 
from  all  students),  there  is  evidence  that  each  of  four  smartpen 
measures  (i.e.,  total  strokes,  problems  attempted,  out  of  order,  and 
due  date  ink  fraction)  makes  a  unique  contribution  to  predicting 
course  grade. 

As  a  follow-up  we  conducted  another  stepwise  linear  regression 
identical  to  the  one  described  previously,  but  with  SAT  score 
entered  as  the  first  variable  and  the  smartpen  variables  entered  in 
order  of  their  correlation.  For  all  students,  SAT  score,  text  strokes, 
problems  attempted,  out  of  order  and  due  date  ink  fraction  were 
selected  with  r  —  .63,  p  <  .001.  For  Cohort  1,  SAT  score,  total 
strokes,  out  of  order  and  late  night  ink  fraction  were  selected  with 
r  =  .76,  p  <  .001.  For  Cohort  2,  SAT  score,  problems  attempted , 
and  due  date  ink  fraction  were  selected  with  r  —  .71,  p  <  .001.  For 
Cohort  3,  SAT  score  and  text  strokes  were  selected  with  r  =  .63, 
p  <  .001.  Thus,  in  the  analysis  with  the  best  statistical  power  (i.e., 
the  combined  data  from  all  students),  there  is  evidence  that  each  of 
four  smartpen  measures  (i.e.,  text  strokes,  problems  attempted,  out 


of  order,  and  due  date  ink  fraction)  contribute  uniquely  to  predict¬ 
ing  course  grade,  even  when  the  effects  of  prior  knowledge  are 
controlled  (i.e.,  smartpen  variables  predict  course  grade  beyond  the 
effects  of  SAT  score).  Overall,  although  construction  of  a  factor- 
analyzed  measurement  instrument  based  on  smartpen  variables  is 
beyond  the  scope  of  this  study,  there  are  indications  that  course 
grade  is  uniquely  predicted  by  a  collection  of  smartpen  measures. 

The  final  course  grade  includes  homework  score  with  a  weight 
of  10%.  To  examine  if  this  artificially  increased  the  correlations 
between  our  measures  of  homework  activity  and  course  grade,  we 
recomputed  the  course  grade  excluding  homework  score  and  re¬ 
computed  the  correlations.  This  resulted  in  only  a  negligible 
change  in  the  correlations,  and  no  change  in  the  factors  chosen  in 
the  regression  analyses.  More  specifically,  for  Cohorts  1  and  2,  the 
changes  in  correlations  were  no  greater  than  .04.  For  Cohort  3,  the 
changes  were  no  greater  than  .07. 

How  Much:  Is  Homework  Time  Related  to 
Course  Grade? 

According  to  the  basic  version  of  the  time-on-task  hypothesis, 
students  who  spend  more  time  working  on  their  homework  should 
get  better  grades  in  the  course.  The  first  line  of  Table  2  shows  the 
correlation  between  total  time  spent  on  the  homework  problems 
and  course  grade  for  all  students  combined,  and  for  each  cohort 
separately.  As  the  table  illustrates,  there  is  a  significant  correlation 
for  each  cohort  and  for  all  students  combined,  consistent  with 
predictions.  Overall,  there  is  strong  and  consistent  support  for  the 
time-on-task  hypothesis,  based  on  data  collected  through  smartpen 
technology. 

What  happens  when  we  look  at  students’  self-reported  time  on 
homework  per  week  as  reported  on  a  postquestionnaire?  In  con¬ 
trast  to  the  significant  correlation  between  course  grade  and  the 
actual  time  on  homework  recorded  through  smartpens,  the  corre¬ 
lation  between  course  grade  and  self-reported  time  on  homework 
is  not  positively  significant  for  all  students  combined  (r  =  —.16) 
nor  for  each  of  the  three  cohorts  (r  =  —.29  for  Cohort  1 ,  r  =  —.14 
for  Cohort  2,  and  r  —  —.13  for  Cohort  3).  Instead,  the  correlation 
is  negative  for  all  three  cohorts,  and  the  negative  correlation  for 
Cohort  1  is  statistically  significant. 
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Furthermore,  is  the  students’  self-reported  time  consistent  with 
the  actual  measured  time  to  complete  homework  assignments?  For 
two  of  the  three  cohorts,  there  was  only  a  weak  correlation  be¬ 
tween  self-reported  time  and  total  homework  time :  For  Cohort  1, 
r  —  .21,  p  =  .052;  for  Cohort  2,  r  =  .16,  p  =  .139;  and  for  Cohort 
3,  r  =  .35,  p  <  .001.  Additionally,  nearly  all  students  overreported 
their  homework  time.  For  Cohort  1,  88.5%  of  students  overre¬ 
ported  homework  time  with  an  average  overestimation  of  19.0  hr. 
For  Cohort  2,  85.5%  of  students  overreported  homework  time  with 
an  average  overestimation  of  23.5  hr.  For  Cohort  3,  85.5%  of 
students  overreported  homework  time  with  an  average  overesti¬ 
mation  of  13.4  hr. 

This  pattern  of  differences  between  actual  time  and  self-reported 
time  points  to  the  value  of  technology-supported  measures  of 
homework  activity  in  testing  the  time-on-task  hypothesis.  This  set 
of  contrasting  findings  constitutes  a  major  contribution  of  this 
study. 

When:  Is  the  Timeliness  of  Homework  Activity 
Related  to  Course  Grade? 

According  to  the  updated  version  of  the  time-on-task  hypothesis, 
which  considers  the  quality  of  the  time  spent  on  homework,  students 
who  commonly  wait  until  the  last  minute  to  do  homework  (i.e.,  within 
24  hr  of  the  due  date)  or  who  commonly  do  homework  late  at  night 
(i.e.,  midnight  to  4  a.m.)  should  get  worse  grades  in  the  course. 
Consistent  with  this  prediction,  the  second  line  of  Table  2  shows  a 
significant  negative  correlation  between  due  date  ink  fraction  and 
course  grade  for  all  students,  and  for  each  cohort.  In  contrast,  the  third 
tine  of  Table  2  does  not  show  a  significant  correlation  between  late 
night  fraction  and  course  grade  for  any  of  the  cohorts,  suggesting 
perhaps  that  working  late  at  night  is  not  necessarily  an  indication  of 
lower  quality  time.  Overall,  a  major  empirical  contribution  is  strong 
and  consistent  evidence  that  the  quality  of  how  homework  time  is 
spent  (as  measured  by  the  proportion  of  homework  time  done  within 
24  hr  of  the  deadline)  is  related  to  course  grade.  The  smartpen 
technology  allows  us  to  address  this  prediction  of  an  updated  version 
of  the  time-on-task  hypothesis. 

Similarly,  breaking  an  assignment  up  into  multiple  sessions  may 
be  a  way  to  enable  distributed  practice — spreading  practice  over 
multiple  sessions — -which  has  been  shown  to  improve  learning 
(Dunlosky,  Rawson,  Marsh,  Nathan,  &  Willingham,  2013).  Ac¬ 
cordingly,  time-on-task  should  be  most  efficient  when  it  is  spread 
over  multiple  sessions.  Consistent  with  this  prediction,  the  fourth 
line  of  Table  2  shows  that  number  of  homework  sessions  correlates 
significantly  with  course  grade  for  all  students  combined  and  for 
two  of  the  three  cohorts.  Again,  smartpen  technology  allows  us  to 
address  a  prediction  about  the  time-course  of  homework  activity 
using  data  that  is  not  otherwise  available. 

How  Many:  Is  the  Amount  of  Writing  Activity  Related 
to  Course  Grade? 

According  to  the  updated  version  of  the  time-on-task  hypothe¬ 
sis,  which  considers  the  amount  of  productive  activity,  students 
who  create  more  pen  strokes  while  working  on  homework  assign¬ 
ments  should  get  better  grades  in  the  course.  Consistent  with  this 
prediction,  the  fifth  line  of  Table  2  shows  a  significant  correlation 
between  total  strokes  and  course  grade  for  all  students,  and  for 


each  cohort.  Also  consistent  with  predictions,  the  next  lines  in 
Table  2  show  the  same  pattern  of  significant  correlations  (for  all 
cohorts)  between  grades  and  equation  strokes,  diagram  strokes, 
cross-out  strokes,  and  total  ink  length,  respectively.  Overall,  there 
is  consistent  evidence  that  higher  achievement  is  related  to  the 
level  of  effort  exerted  by  students  as  indicated  by  their  pen  strokes. 

Similarly,  Table  2  shows  that  for  all  students  and  for  each  cohort 
separately,  there  is  a  significant  correlation  between  the  number  of 
problems  attempted  and  course  grade  (line  10)  and  between  aver¬ 
age  time  per  problem  and  course  grade  (line  11). 

What  is  not  related?  Two  variables  did  not  correlate  consistently 
with  course  grade — average  pen  speed  and  out  of  order — perhaps 
because  they  are  not  appropriate  measures  of  the  amount  of  pro¬ 
ductive  activity.  Writing  faster  or  slower  does  not  necessarily 
indicate  more  or  less  effort,  and  trying  problems  out  of  order  can 
be  attributed  to  several  causes  other  than  effort,  including  lack  of 
concentration. 

Discussion 

Empirical  Implications 

Concerning  issues  about  time-on-task,  actual  time  spent  work¬ 
ing  on  homework  problems  was  positively  correlated  with  course 
grades,  but  self-reported  time  spent  on  homework  problems  was 
not.  Concerning  the  time  course  of  homework  activity,  the  amount 
of  homework  time  spent  within  24  hr  of  the  deadline  was  nega¬ 
tively  correlated  with  course  grades,  the  amount  of  homework  time 
spent  between  midnight  and  4  a.m.  was  not  correlated  with  course 
grades,  and  breaking  homework  time  into  more  sessions  was 
positively  correlated  with  course  grades.  Concerning  actual  behav¬ 
ior  and  effort  on  homework  problems,  course  grades  were  posi¬ 
tively  correlated  with  the  total  number  of  pen  strokes,  equation 
strokes,  diagram  strokes,  cross-out  strokes,  total  ink  produced, 
total  problems  attempted,  and  time  per  problem.  Course  grades 
were  not  consistently  correlated  with  average  pen  speed  or  solving 
problems  out  of  order. 

Theoretical  Implications 

This  study  investigates  a  crucial  link  in  a  model  of  academic 
learning,  the  link  between  engagement  or  effort  to  learn,  as  mea¬ 
sured  by  the  amount  and  time  students  allocate  to  a  learning  task, 
and  performance,  as  measured  by  learning  outcome  in  a  college 
course.  In  particular,  the  present  study  examines  the  idea  that  the 
amount  of  time  that  students  spend  in  productive  learning  is  related 
to  academic  achievement  in  a  course.  Although  no  causal  conclu¬ 
sions  can  be  drawn,  the  work  draws  attention  to  a  potential  causal 
mechanism  leading  to  learning — namely,  amount  of  productive 
learning  activity.  Importantly,  both  the  quantity  of  time  spent  on 
homework  and  the  quality  of  how  homework  time  is  used  are 
related  to  achievement.  Higher  quality  xise  of  time  is  reflected  in 
doing  homework  long  before  it  is  due  and  breaking  assignments 
into  smaller  sessions.  Effortful  activity  on  homework  is  reflected 
in  the  number  of  pen  strokes,  the  total  ink  produced,  and  the 
number  of  problems  attempted.  A  major  contribution  of  this  proj¬ 
ect  is  to  enable  more  detailed  measures  of  student  effort  or  en¬ 
gagement — which  is  proposed  to  be  the  mechanism  underlying 
academic  learning. 
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Practical  Implications 

This  is  a  correlational  study  that  examines  actual  performance  in 
a  real  college  course,  so  no  causal  conclusions  can  be  drawn. 
However,  this  study  offers  preliminary  evidence  for  the  potential 
of  homework  as  an  aid  to  student  achievement,  particularly  when 
students  work  on  their  homework  in  a  timely  and  effortful  way. 
The  role  of  productive  time  on  task  has  long  been  recognized  as  a 
critical  issue  in  intelligent  tutoring  systems  (Anderson,  Corbett, 
Koedinger,  &  Pelletier,  1995). 

Methodological  Implications 

This  study  highlights  the  potential  of  educational  data  mining 
techniques  in  general  (i.e.,  techniques  for  measuring  and  summa¬ 
rizing  learner  activity  during  learning),  and  smartpen  technology  in 
particular,  for  educational  research.  Smartpen  technology  allows 
for  assessing  student  study  activity  at  a  level  of  detail  that  was  not 
previously  possible,  and  thereby  offers  a  new  and  powerful  meth¬ 
odology  for  testing  implications  of  educational  theories. 

Our  results  confirm  what  other  researchers  have  proposed 
(Blumner  &  Richards,  1997;  Schuman  et  al.,  1985):  Students’ 
self-reports  of  study  effort  are  often  unreliable.  Finally,  this  study 
points  to  the  useful  role  of  replication  in  educational  research  (as 
noted  by  Shavelson  &  Towne,  2002)  by  showing  the  same  pattern 
of  results  across  three  independent  cohorts  of  students. 

Limitations  and  Future  Directions 

This  work  is  a  first  step  at  building  techniques  that  can  provide 
automated  assessment  of  performance  from  an  analysis  of  hand¬ 
written  homework.  Our  present  analysis  examines  the  relationship 
between  the  amount  of  effort  on  homework  and  performance.  In 
future  work,  we  plan  to  examine  how  patterns  of  homework 
activity  contribute  to  success.  For  example,  our  current  analysis 
suggests  that  doing  homework  just  before  it  is  due  may  not  be  a 
successful  strategy,  but  causal  claims  cannot  be  drawn  based  on 
the  correlational  relationship  identified  in  this  study.  Thus  exper¬ 
imental  work  is  needed  to  test  causal  claims.  Experimental  re¬ 
search  should  be  designed  to  explicitly  test  hypotheses  suggested 
by  this  study  concerning  the  possible  causal  role  of  productive 
time  on  task  by  directly  manipulating  this  factor  and  examining  the 
effects  on  learning  outcome.  Within  the  context  of  experimental 
research,  future  work  is  also  needed  to  examine  potential  moder¬ 
ating  variables  such  as  the  learner’s  prior  knowledge  and  meta- 
cognitive  skills. 

The  correlations  involving  self-reported  homework  time  and 
actual  homework  time  should  be  interpreted  in  light  of  the  fact  that 
our  self-report  measure  of  homework  time  involved  asking  the 
learner  to  check  a  category  that  was  converted  to  a  number  for 
analyses  (such  as  “two  to  four  hours  per  week”  being  recorded  as 
3,  “four  to  six  hours  per  week”  being  recorded  as  5,  etc.),  whereas 
our  smartpen  measure  of  homework  time  is  based  on  a  continuous 
scale. 

Another  concern  is  whether  the  act  of  being  asked  to  use 
smartpens,  and  the  ensuing  awareness  of  being  observed,  could 
cause  students  to  be  more  careful  about  how  they  deal  with 
homework  problems  than  they  would  otherwise,  could  make  them 
want  to  use  scratchpaper  before  solving  homework  problems  with 


a  smartpen,  and  could  create  discomfort  or  distraction  that  affect 
homework  behavior.  In  short,  it  is  important  to  ensure  that  students 
do  their  homework  in  their  usual  way  but  with  use  of  their 
smartpens  and  nothing  else.  In  the  present  study,  students  were 
instructed  to  show  all  their  work  using  their  smartpen,  and  a 
postexperimental  questionnaire  indicated  reasonable  compliance. 
On  a  survey  from  Cohort  3  asking  students  to  rate  smartpen  use  on 
a  scale  of  1  (“doing  all  homework  elsewhere”)  and  7  (“using  the 
pen  to  do  everything”)  the  mean  rating  was  5.1  ( SD  =  1 .7).  Future 
work  should  involve  more  evidence  concerning  fidelity,  such  as 
poststudy  interviews.  Similarly,  the  total  time  measure  was  not 
based  on  any  activity  before  the  first  pen  stroke  so  it  would  not 
include  time  to  initially  read  and  think  about  the  problem  before 
starting  to  answer.  Another  thorny  issue  concerns  whether  course 
grade  is  an  adequate  measure  of  learning  outcome.  In  the  present 
study,  course  grade  was  based  on  tests  that  involved  concepts 
related  to  the  homework  problems,  but  no  detailed  method  of 
alignment  was  implemented. 

The  present  work  is  based  on  the  idea  that  a  deeper  analysis  of 
the  sequencing  of  homework  activities  can  provide  additional 
insights  about  successful  and  unsuccessful  study  strategies.  Iden¬ 
tifying  such  strategies  will  lead  to  experimental  studies  that  ulti¬ 
mately  enable  automated  coaching  systems  to  examine  students’ 
study  habits  and  recommend  interventions  aimed  at  increasing 
academic  success.  Additionally,  we  plan  to  extend  the  smartpen 
technology  to  the  study  of  note-taking  during  classroom  lectures  in 
order  to  identify  classroom  learning  strategies  that  are  related  to 
course  grade. 

This  work  is  a  step  in  applying  educational  data  mining  tech¬ 
niques  to  learning  activities  in  traditional,  rather  than  online, 
environments.  Our  current  studies  have  focused  on  one  course 
(i.e.,  statics),  and  more  work  is  needed  to  determine  how  our 
techniques  will  generalize  to  other  domains  for  which  homework 
assignments  comprise  handwritten  problem  solving.  We  anticipate 
that  our  techniques  will  be  applicable  to  assessing  homework 
habits  in  a  variety  of  math,  science,  and  engineering  subjects. 
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This  research  investigated  the  effects  of  3  instructional  design  formats  on  learning  introductory  account¬ 
ing.  In  accordance  with  cognitive  load  theory,  it  was  predicted  that  students  who  would  learn  with  a 
guided  self-managed  instructional  design  format  would  outperform  students  who  would  learn  with  a 
conventional  split-attention  format  or  an  integrated  format  on  a  recall  test  and  a  transfer  test.  In  the  guided 
self-management  condition  students  were  instructed  to  reorganize  text  and  diagrams  to  reduce  the  need 
to  search  the  solution  steps  within  the  text  and  match  them  with  corresponding  parts  of  the  diagram, 
thereby  freeing  cognitive  resources  for  learning.  The  results  of  an  experiment  conducted  with  123 
undergraduate  university  students  confirmed  the  hypothesis  by  consistently  demonstrating  that  students 
in  the  guided  self-managed  condition  outperformed  students  in  the  integrated  and  split-attention  condi¬ 
tions  on  the  recall  and  transfer  tests. 

Keywords:  accounting,  cognitive  load  theory,  learning,  guided  self-management,  split-attention 


Cognitive  load  theory  (CLT;  Ayres  &  Sweller,  2005;  Paas, 
Renkl,  &  Sweller,  2003;  Sweller,  2015;  Sweller,  Ayres,  &  Ka- 
lyuga,  2011)  uses  knowledge  of  human  cognition  to  provide  in¬ 
structional  design  principles  that  support  the  efficient  use  of  work¬ 
ing  memory.  Over  the  last  three  decades  CLT  research  has  almost 
exclusively  focused  on  instructor-managed  cognitive  load,  on  how 
instructional  designers  can  best  design  learning  materials  follow¬ 
ing  CLT  principles  (Paas,  Van  Gog,  &  Sweller,  2010).  The  basic 
idea  is  that  when  the  instructional  principles  are  used  by  instruc¬ 
tional  designers  they  lead  to  decreased  working  memory  load 
caused  by  task  aspects  and  activities  that  are  unproductive  for 
learning,  thereby  freeing  up  working  memory  resources  for  activ¬ 
ities  that  are  productive  for  learning. 

The  study  reported  in  this  article  takes  a  different  approach  and 
follows  up  on  recent  research  focusing  on  equipping  learners  with 
strategies  to  self-manage  cognitive  load  when  dealing  with  instruc¬ 
tional  materials  with  evident  split  attention  (see  also  Agostinho, 
Tindall-Ford,  &  Roodenrys,  2013;  Roodenrys,  Agostinho,  Rood- 
enrys,  &  Chandler,  2012).  Despite  the  empirically  convincing 
superiority  of  integrated  formats  over  split-source  formats  (Ginns, 
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2006),  recent  studies  of  Agostinho  and  colleagues  (2013)  and 
Roodenrys  and  colleagues  (2012)  have  suggested  that  learners  can 
also  be  trained  to  manage  their  own  cognitive  load  when  con¬ 
fronted  with  split-attention  materials. 

The  split-attention  effects  is  one  of  the  major  design  principles 
of  CLT,  which  indicates  that  replacing  multiple  sources  of  visual 
information  with  a  single,  integrated  source  of  information  leads  to 
better  learning  (Ayres  &  Sweller,  2005;  Ginns,  2006).  Separated 
sources  of  information  such  as  text  and  diagram  require  the  learner 
to  hold  small  segments  of  text  in  memory  while  searching  for  the 
matching  diagram.  Such  a  process  continues  until  all  the  informa¬ 
tion  is  rendered  intelligible  (Agostinho  et  al.,  2013). 

Models  and  Learning  From  Multiple  Representations 

Learning  environments  frequently  combine  several  forms  of 
representations  like  animations,  pictures,  texts,  tables,  or  formulas. 
However,  learners’  acquisition  of  knowledge  from  these  multiple 
sources  requires  learners  to  create  referential  connections  between 
corresponding  elements  and  corresponding  structures  in  the  differ¬ 
ent  representations  to  construct  a  coherent  mental  representation 
(e.g.,  Schwartz  &  Martin,  2004;  Seufert,  2003).  Many  studies 
investigating  learning  from  multiple  representations  use  multi¬ 
modal  theories  of  human  memory  to  formulate  hypotheses  and 
explain  results  (see,  e.g.,  Liu,  Lin,  Tsai,  &  Paas,  2012).  Two 
influential  models  that  have  inspired  many  other  theories,  such  as 
cognitive  load  theory  (Plass,  Moreiio,  &  Briinken,  2010;  Sweller, 
2015)  and  the  cognitive  theory  of  multimedia  learning  (Mayer, 
2009),  are  Baddeley’s  working  memory  model  (Baddeley,  1992) 
and  Paivio’s  dual-coding  model  (Clark  &  Paivio,  1991;  Paivio, 
1986).  Baddeley’s  model  divides  working  memory  into  a  “visual- 
spatial  scratch  pad”  for  dealing  with  visually  based  information 
and  a  “phonological  loop”  to  deal  with  auditory,  primarily  speech- 
based,  information.  These  two  systems,  in  turn,  are  governed  by  a 
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central  executive.  In  Paivio’s  model,  pictorial  information  and 
verbal  information  processed  in  different  cognitive  subsystems: 
An  imagery  system  and  a  verbal  system.  Pictures  are  processed 
and  encoded  both  in  the  imagery  system  and  in  the  verbal  system, 
while  words  and  sentences  are  generally  processed  and  encoded 
only  in  the  verbal  system.  The  memory-enhancing  effect  of  pic¬ 
tures  in  texts  is  attributed  to  the  benefit  of  a  dual  coding  as 
compared  with  single  coding  in  memory. 

Despite  the  benefits  associated  with  learning  from  multiple 
representations  such  as  elaboration,  flexibility  and  multiple  per¬ 
spectives,  there  is  empirical  evidence  suggesting  that  learners, 
particularly  those  with  little  prior  knowledge  have  difficulties  to 
integrate  multiple  sources  of  information  and  typically  fail  to 
construct  a  coherent  mental  representation  (Mayer,  1997;  Rau, 
2015;  Stem,  Aprea,  &  Ebner,  2003).  In  this  study  we  use  cognitive 
load  theory  to  explain  this  failure  and  to  provide  a  potential 
solution  in  terms  of  guided  self-management  of  cognitive  load. 

CLT  and  Its  Instructional  Formats 

CLT  is  based  on  the  assumption  that  human  cognitive  architec¬ 
ture  consists  of  a  working  memory  with  very  limited  capacity 
when  dealing  with  new  information  (Sweller,  2015)  and  an  unlim¬ 
ited  long-term  memory  in  which  elements  are  organized  and  stored 
in  the  form  of  domain-specific  knowledge  stmctures  known  as 
schemas  (van  Merrienboer  &  Ayres,  2005).  Although  working 
memory  can  only  hold  a  limited  number  of  elements  at  any  given 
time,  working  memory  can  access  complex  schemas  consisting  of 
huge  arrays  of  interrelated  elements  allowing  a  learner  access  to 
previously  learned  material  stored  in  the  long-term  memory 
(Sweller,  2015).  Access  of  schemas  in  long-term  memory  can 
reduce  working  memory  load,  thereby  freeing  working  memory 
resources  for  learning.  However,  learning  content  that  is  novel  to 
students  does  not  have  associated  schemas  in  long-term  memory. 

CLT  suggests  that  effective  instructional  design  should  take  into 
account  human  cognitive  architecture,  in  particular  concentrating 
on  effective  use  of  limited  working  memory  resources  (Choi,  Van 
Merrienboer,  &  Paas,  2014;  Sweller,  2015).  One  of  the  loads 
identified  by  CLT  is  extraneous  cognitive  load,  which  is  the 
burden  imposed  on  working  memory  by  the  manner  in  which  the 
information  is  presented  or  the  activities  in  which  the  learner  must 
engage  (Sweller  et  al.,  2011).  This  load  can  result  from  poorly 
designed  instructional  material.  CLT  research  has  mainly  focused 
on  instructor-manipulated  instructional  materials  and  providing 
instructor  designed  learning  materials  that  take  into  account  the 
cognitive  load  imposed  on  the  learner. 

When  learning  new  content,  instructional  formats  having  a 
separate  text  and  diagram  hinder  learning  because  they  require  a 
learner  to  search  relevant  text  and  match  it  with  particular  sections 
of  the  diagram.  Such  a  presentation  unnecessarily  overloads  the 
limited  capacity  of  working  memory  that  is  not  relevant  to  the 
learning  process  (Leahy,  Chandler,  &  Sweller,  2003).  Hence  many 
studies  have  illustrated  the  importance  of  instructional  material 
designed  with  CLT  principles  in  mind.  Five  of  the  most  researched 
CLT  derived  instructional  effects  are  the  (a)  expertise  reversal 
effect  (e.g.,  Blayney,  Kalyuga,  &  Sweller,  2010;  Kalyuga,  Ayres, 
Chandler,  &  Sweller,  2003).  (b)  worked  example  effect  (e.g.,  Paas 
&  Van  Gog,  2006;  Sweller,  2006),  (c)  split-attention  effect  (e.g., 
Ayres  &  Sweller,  2005;  Clark,  Ngyuen,  &  Sweller,  2006),  (d) 


modality  effect  (e.g.,  Ginns,  2005;  Goolkasian,  Foos,  &  Eaton, 
2009),  and  (e)  redundancy  effect  (e.g.,  Samur,  2012).  For  an 
exhaustive  overview  of  CLT-based  instructional  formats  and  their 
empirical  base,  see  Sweller  et  al.  (201 1)  and  van  Merrienboer  and 
Sweller  (2005).  These  design  principles  have  been  verified  in 
numerous  experiments  conducted  with  a  diverse  range  of  instruc¬ 
tional  materials.  Within  the  cognitive  load  theory  framework,  one 
main  characteristic  that  has  been  identified  as  a  contributor  to  the 
negative  effect  on  learning  is  the  need  for  learners  to  split-attention 
between  multiple  sources  of  information  that  must  be  integrated 
before  they  can  be  understood  (Ayres  &  Sweller,  2005;  Clark  et 
al.,  2006). 

To  summarize,  empirical  research  has  provided  valuable  in¬ 
sights  into  different  facets  of  learning,  for  example,  demonstrating 
that  the  process  of  coherence  formation  is  cognitively  demanding 
and  learners  with  insufficient  prior  knowledge  are  often  unable  to 
cope  with  this  task.  Consequently,  they  do  not  use  different  rep¬ 
resentations  but  rather  concentrate  only  on  one  representation  and 
therefore  fail  to  integrate  and  reach  the  goals  of  elaboration, 
abstraction,  flexibility,  and  coherence  (Seufert,  2003;  Seufert  & 
Briinken,  2006). 

Previous  research  has  focused  on  how  educators  can  use  CLT 
designed  instructional  material  to  manage  students’  cognitive  load 
and  improve  their  learning  performance,  but  the  current  research 
investigates  student  application  of  CLT  design  principles  to  man¬ 
age  their  own  cognitive  load  and  improve  learning.  Research  on 
learning  from  split-attention  instructional  materials  has  not  exten¬ 
sively  provided  techniques  that  would  empower  students  to  suc¬ 
cessfully  and  systematically  deal  with  split-attention  materials  on 
their  own.  This  study  investigates  whether  it  is  possible  to  instruct 
novice  students  on  how  to  self-manage  the  split  attention.  With  an 
exclusive  focus  on  novice  students  using  accounting  instructional 
material,  two  cognitive  load  theory  effects  that  may  result  from 
manipulating  those  materials,  that  is,  the  split-attention  effect  and 
guided  self-management  effect  were  explored.  The  next  sections 
discuss  the  split  attention  effect  in  the  context  of  accounting 
instructional  material,  followed  by  a  discussion  on  guided  self¬ 
management  instructional  designs.  Thereafter,  an  experiment  is 
presented  for  the  evaluation  of  the  guided  self-management  strat¬ 
egy. 

The  Split-Attention  Effect 

Separate  text  and  diagrams  are  very  difficult  to  understand  and 
consequently  have  a  negative  effect  on  learning  (Ginns,  2006). 
This  form  of  presentation,  referred  to  as  split-attention,  demands 
the  learner’s  effort  to  mentally  reorganize  the  text  or  related 
explanations  related  to  the  diagram  and/or  symbols.  Split-attention 
occurs  when  texts  accompanying  a  diagram  are  presented  sepa¬ 
rately  and  are  unintelligible  in  isolation.  To  understand  the  mate¬ 
rial,  the  learner  must  hold  small  pieces  of  text  in  working  memory 
while  searching  for  the  matching  relevant  diagrammatic  represen¬ 
tation.  This  process  continues  until  all  the  information  is  rendered 
intelligible. 

Over  the  last  two  decades  researchers  have  been  developing 
alternatives  to  instructional  formats  that  require  learners  to  exten¬ 
sively  search  and  match  that  increases  the  load  in  working  memory 
(Ayres  &  Sweller,  2005;  Florax  &  Ploetzner,  2010;  Morrison, 
Dorn,  &  Guzdial,  2014).  One  successful  strategy  to  reduce  work- 
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ing  memory  load  imposed  by  search  and  match  activities  that  are 
not  relevant  to  learning  is  physical  integration  of  the  different 
information  sources  (Ayres  &  Sweller,  2005;  Paas  et  al.,  2003; 
Sweller,  2015;  Sweller  et  al.,  201 1).  In  a  diagram  and  text  presen¬ 
tation,  the  separation  of  the  text  from  the  diagram  forces  the 
learner  to  look  back  and  forth  between  the  relevant  parts  of  the 
diagram  and  the  text,  a  process  that  unnecessarily  imposes  working 
memory  load.  If  the  diagram  is  combined  with  text,  the  learner 
could  concentrate  better  on  learning  the  content  from  the  combined 
presentation  (Sweller  et  al.,  2011).  Research  has  shown  that  inte¬ 
grated  formats  are  superior  to  split-attention  formats  even  though 
learners  are  required  to  mentally  integrate  under  both  formats 
(Agostinho  et  al.,  2013;  Paas  et  al.,  2003;  Sweller,  2015). 

A  meta-analysis  of  the  split-attention  effect  has  shown  that 
integrated  instructional  formats  reduce  extraneous  cognitive  load 
(Ginns,  2006).  Replacing  multiple  sources  of  information  with  a 
single,  integrated  source  of  information  assists  with  more  effective 
learning  (Ginns,  2006;  Mayer,  2009).  The  split-attention  effect 
(e.g.,  Rose  &  Wolfe,  2000)  can  arise  because  information  is 
spatially  (spatial  contiguity  effect;  e.g.,  Clark  &  Mayer,  2008)  and 
temporally  separated  (temporal  contiguity  effect;  e.g.,  Ginns, 
2006).  These  studies  have  shown  that  students  often  learn  more 
when  complex  educational  content  is  designed  to  reduce  the  space 
(spatial  contiguity  effect)  or  time  (temporal  contiguity  effect) 
between  disparate  but  related  elements  of  learning  content. 

An  example  of  integration  of  instructional  material  is  when 
fragments  of  text  are  directly  embedded  into  a  diagrammatic 
presentation  or  as  close  as  possible  to  corresponding  components 
of  a  diagram  (see  Figure  1).  As  illustrated  in  Figure  1,  the  first 
diagram  (i),  presents  an  example  of  a  conventional  split-attention 
worked  example  in  geometry.  In  Figure  1,  the  diagram  that  is 
above  the  text  outlines  the  solution  to  the  problem.  The  diagram 
and  text  are  presented  separately.  In  processing  the  information  on 
the  diagram  and  text  below  it,  the  learner  has  to  understand  the 
solution  steps  and  then  link  them  with  the  diagram.  This  requires 
mentally  integrating  the  two  sources  of  information,  drawing  on 
considerable  cognitive  resources  from  the  learner,  contributing  to 
learning  difficulty. 


Problem:  Find  angle  BDE  E 


^DBE=  80(Vertically  opposite  angles) 

^  BDE  +40+80=(Angle  sum  of  a  triangle) 

ZBDE+120=180 

/.BDE=60 

(i) 


In  the  integrated  worked  example  (see  Figure  1,  second  diagram 
[ii]),  the  learner  can  allocate  working  memory  resources  to  the 
relational  dimensions  of  the  problem,  because  his  or  her  mental 
capacity  is  released  from  the  need  to  search  and  match  the  solution 
steps  and  link  them  with  the  diagram.  The  integrated  example 
enhances  learning  since  it  guides  the  learner  through  the  steps  of  a 
worked  example  (Ayres  &  Sweller,  2005;  Ginns,  2006;  Mayer, 
2009). 

While  a  lot  of  instructional  materials  make  use  of  both  a  dia¬ 
grammatic  component  and  a  textual  component  of  information, 
which  imposes  a  high  demand  on  working  memory,  the  current 
most  efficient  method  of  dealing  with  split-attention  is  physically 
integrating  instructions  (Agostinho  et  al.,  2013).  This  manner  of 
presentation  represents  a  form  of  instructor-manipulated  interven¬ 
tion  (Paas  et  al.,  2010).  The  argument  proposed  in  this  article  is 
that  guided  self-management,  in  which  learners  are  asked  to  link 
text  to  relevant  parts  in  the  diagram  may  be  an  alternative  to 
instructor  integrated  instructional  materials.  Particularly  in  the 
current  educational  environment  where  learners  can  access  vast 
amounts  of  information,  it  is, likely  that  learners  will  often  be 
confronted  with  split-attention  learning  materials  without  any  form 
of  instructional  guidance.  In  those  cases  learners  will  have  to 
self-manage  learning  from  the  split-attention  materials. 

Self-Management  Effect 

The  self-management  effect  was  recently  developed  within  a 
cognitive  load  framework  by  Roodenrys  and  colleagues  (2012;  see 
also  Agostinho  et  al.,  2013).  The  researchers  involved  noticed  the 
high  variability  of  instructional  formats  used  on  the  World  Wide 
Web  and  elsewhere  and  the  high  likelihood  of  these  materials 
being  designed  without  any  cognitive  load  considerations.  Largely, 
two  options  are  available  to  cognitive  load  theorists.  The  first  is  to 
keep  on  reconstructing  deficient  instructional  formats  (e.g.,  split 
attention)  into  more  effective  formats  (e.g.,  integrated).  However, 
given  the  sheer  amount  of  information  that  now  exists  electroni¬ 
cally,  most  of  it  generated  without  cognitive  considerations,  there 


(ii) 


Figure  1.  Split  attention  format  (i)  and  Integrated  format  (ii).  Source:  Ayres  and  Sweller  (2005,  p.  208). 
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are  now  severe  limitations  that  instructors  can  realistically  achieve 
to  facilitate  learning. 

Thus,  the  field  has  started  to  move  from  simply  reconstructing 
deficient  formats  to  look  at  areas  of  self-regulated  learning  as  a 
means  of  controlling  learners’  own  cognitive  load  Although  there 
is  an  extensive  literature  on  active  integration  performed  by  learn¬ 
ers  (see,  e.g.,  Aleven,  McLaren,  Roll,  &  Koedinger,  2006; 
Berthold  &  Renkl,  2009;  Bjork  &  Bjork,  2011;  Bodemer, 
Ploetzner,  Feuerlein,  &  Spada,  2004;  Mason,  Caterina,  & 
Pluchino,  2013),  the  guided-self-management  strategy  that  is  in¬ 
vestigated  in  this  study  specifically  focuses  on  self-management  of 
cognitive  load  and  is  considered  a  general  skill  that  can  be  trans¬ 
ferred  by  learners  to  other  domains.  Guided  self-management  is  a 
very  specific  and  direct  form  of  instructional  guidance  designed  to 
facilitate  the  skill  of  self-management  of  learning.  Learners  can  be 
given  very  specific  coaching  into  identifying  inefficient  instruc¬ 
tional  formats  and  then  given  examples  of  how  to  self-manage 
such  load  (Agostinho,  Tindall-Ford,  &  Bokosmaty,  2014;  Gordon, 
Tindall-Ford,  Agostinho,  &  Paas,  2016;  Roodenrys  et  al.,  2012; 
Tindall-Ford,  Agostinho,  Bokosmaty,  Paas,  &  Chandler,  2015). 
The  results  of  a  number  of  studies  have  indicated  that  not  only 
does  this  self-management  work,  in  the  long  run  it  may  also  be 
more  effective  than  any  other  cognitive  load  generated  format  for 
assisting  transfer.  The  article  will  now  discuss  some  of  the  research 
in  the  area  of  the  self-management  effect. 

Self-management  consists  of  making  connections  between  the 
two  representations  (e.g.,  text  and  diagram)  through  annotations 
such  as  highlighting,  underlining,  and  drawing  arrows  (Agostinho 
et  al.,  2014,  2013;  Roodenrys  et  al.,  2012).  Most  self-management 
studies  involve  the  use  of  split-attention  self-management  tech¬ 
niques  with  either  paper-based  materials  or  in  an  online  environ¬ 
ment,  drawing  arrows,  numbering  or  moving  text  to  related  dia¬ 
grammatic  components.  Self-management  in  the  experiment 
conducted  in  this  article  required  the  learner  to  use  paper-based 
materials  to  highlight,  underline,  and  draw  arrows  to  parts  of  the 
diagram  that  the  learner  understands  are  related,  the  aim  being  to 
reduce  learners’  need  to  search  and,  thus,  freeing  cognitive  re¬ 
sources  for  learning. 

Confronting  learners  with  conditions  of  learning  that  impose 
initial  challenges  to  the  learner  (i.e.,  desirable  difficulties)  have 
been  found  to  enhance  retention  and  transfer  performance  (Bjork 
&  Kroll,  2015).  For  example  creating  small  discrepancies  between 
an  auditory  narration  and  on-screen  text  can  be  desirable.  In  the 
study  of  Bjork  and  Kroll  (2015)  participants  studied  a  lesson  about 
the  life  cycle  of  a  star  that  comprised  animation,  narration,  and 
on-screen  text.  When  the  narration  was  a  little  different  from  the 
on-screen  text,  participants  learned  more  than  when  the  narration 
and  text  were  identical  (Bjork  &  Kroll,  2015;  Yue,  Bjork,  &  Bjork, 
2013).  Paas  and  Van  Merrienboer  ( 1 994)  also  showed  that  students 
who  studied  high-variability  worked  examples  invested  less  time 
and  mental  effort  in  practice,  and  attained  better  and  less  effort¬ 
demanding  transfer  performance  than  students  who  solved  high- 
variability  conventional  problems. 

Tindall-Ford  et  al.  (2015)  examined  secondary  school  students 
self-managing  the  split-attention  when  learning  about  the  proper¬ 
ties  of  angles  in  mathematics.  They  found  that  the  students  who 
received  instructor  guidance  to  integrate  text  with  a  diagram  per¬ 
formed  better  on  later  tests  than  students  who  received  no  such 
guidance.  Using  educational  psychology  materials,  Roodenrys  et 


al.’s  (2012)  experiments  showed  that  it  is  possible  to  instruct 
students  on  how  to  self-manage  information.  In  their  first  experi¬ 
ment,  participants  in  an  integrated  group  performed  significantly 
better  than  the  self-management  group  across  recall  and  far  trans¬ 
fer  performance  items.  For  near  transfer  items,  the  self¬ 
management  group  slightly  outperformed  the  integrated  group. 
Roodenrys  et  al.  (2012)  also  concluded  that  self-management 
instructions  need  to  be  carefully  constructed  so  that  the  instruc¬ 
tions  would  not  result  in  an  unnecessary  cognitive  load  because  of 
either  split-attention  or  redundancy.  Their  studies  also  showed  that 
the  positive  effects  of  self-managed  instructions  may  be  demon¬ 
strated  on  transfer  tasks. 

Thus,  to  this  point,  the  research  within  the  self-management 
effect  has  provided  some  promising  performance  over  the  split- 
attention  format  in  areas  such  as  educational  psychology  (Rood¬ 
enrys  et  al.,  2012),  mathematics  (Tindall-Ford  et  al.,  2015),  and 
educational  technology  (Agostinho  et  al.,  2013).  The  current  study 
investigated  whether  a  guided  self-managed  group  in  accounting 
would  be  superior  to  a  split-attention  group. 

The  Current  Study 

Based  on  the  above  discussion,  the  current  study  examined 
the  hypothesized  superior  learning  outcomes  of  guided  self¬ 
management  instructions  and  integrated  instructions  over  con¬ 
ventional  split- attention  instructions.  The  hypotheses  regarding 
the  effects  of  instructional  condition  on  performance  and  mental 
effort  are  described  next. 

Instructional  Design  and  Performance 

1.  Performance  by  split-attention  format  group  (Group  1) 
and  guided  self-managed  group  (Group  3): 

Hypothesis  la:  Students  in  the  guided  self-managed  format 
will  outperform  students  in  the  split-attention  format  on  recall 
tests. 

Hypothesis  lb:  Students  in  the  guided  self-managed  format 
will  outperform  students  in  the  split  -attention  format  on 
transfer  tests. 

Within  cognitive  load  research,  experimental  evidence  has 
shown  that  novices’  learning  to  solve  problems  by  studying  split- 
attention  material  leaves  little  processing  capacity  for  schema 
acquisition  and  the  capability  to  recall  and  transfer  knowledge 
(e.g.,  Ayres  &  Sweller,  2005).  Most  recent  research  (Agostinho  et 
al,  2014;  Roodenrys  et  al.,  2012;  Tindall-Ford  et  al.,  2015)  has 
revealed  that  self-managing  split-attention  problems  (such  as  in 
Group  3)  results  in  superior  performance  as  compared  with  learn¬ 
ers  who  have  not  been  provided  with  any  guidance. 

Instructional  Design  and  Mental  Effort 

Subjective  mental  effort  scores,  which  are  considered  a  reliable 
measure  of  overall  cognitive  load  (Paas  et  al.,  2003),  were  col¬ 
lected  in  this  study.  Based  on  CLT  and  the  empirical  findings  of 
previous  studies  into  the  self-management  effect  the  following 
predictions  regarding  the  effects  of  instructional  design  on  mental 
effort  were  formulated. 
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The  first  hypothesis  in  relation  to  mental  effort  was  tested  in  the 
first  part  of  the  experiment.  Participants  in  Part  1  of  the  Experi¬ 
ment  were  given  specific  guidance  to  assist  them  with  moving  text 
as  close  as  possible  to  the  associated  diagram.  This  is  expected  to 
lead  to  higher  cognitive  load  compared  with  the  other  conditions. 
It  was  predicted  that: 

Hypothesis  2:  Students  in  the  guided  self-managed  format 
group  (Group  3)  will  report  higher  effort  (cognitive  load)  than 
students  in  the  split-attention  format  group  (Group  1),  and  the 
integrated  format  group  (Group  2). 

Cognitive  load  is  increased  by  the  need  to  mentally  integrate 
various  sources  of  information  (Ayres  &  Sweller,  2005).  Hence  the 
split-attention  format  group  should  report  higher  cognitive  load 
than  the  integrated  format  group.  The  guided  self-management 
group  is  expected  to  report  a  high  cognitive  load  because  of  the 
need  to  move  the  text  as  close  as  possible  to  the  associated 
diagram. 

In  relation  to  the  transfer  task  (Part  2  of  the  Experiment), 
participants  in  the  guided  self-managed  format  (Group  3)  were 
expected  to  use  the  guidance  learned  in  Part  1  of  the  Experiment 
to  move  text  as  close  as  possible  to  the  associated  diagram  without 
specific  guidance  from  the  instructor.  This  is  expected  to  lead  to 
higher  cognitive  load  compared  with  the  other  conditions.  It  was 
predicted  that: 

Hypothesis  3:  Students  in  the  guided  self-managed  format 
(Group  3)  would  use  guidance  to  self-manage  and  report 
higher  cognitive  load  than  students  in  the  integrated  format 
group  (Group  2)  and  split-attention  group  (Group  1). 

Research  has  shown  that  learners  have  the  capability  of  self¬ 
managing  instructional  materials  and  perform  better  on  test  items 
compared  with  the  split-attention  group  (Agostinho  et  al.,  2014; 
Roodenrys  et  ah,  2012;  Tindall-Ford  et  ah,  2015).  This  result  has 
come  out  despite  the  fact  that  the  guided  self-management  group  is 
required  to  carry  out  an  additional  task  of  moving  text  as  close  as 
possible  to  the  diagram  during  the  learning  phase.  The  research 
conducted  to  date  shows  that  learners  who  are  taught  to  self- 
manage  instructional  materials  on  their  own  perform  better  than 
the  split-attention  format  (Agostinho  et  ah,  2013;  Roodenrys  et  ah, 
2012).  Learners  who  self-managed  split  attention  performed  the 
same  as  the  integrated  group  (Roodenrys  et  ah,  2012;  Tindall-Ford 
et  ah,  2015).  To  test  the  above  hypotheses,  an  experiment  was 
conducted  in  the  current  study. 

Method:  Part  1  of  the  Experiment 

The  aim  of  the  experiment  was  to  inquire  into  the  split-attention 
effect  in  accounting  instructional  materials.  In  addition,  the  exper¬ 
iment  sought  to  test  whether  specific  guidance  developed  to  assist 
participants  with  moving  text  as  close  as  possible  to  the  associated 
diagram  in  split-attention  learning  materials  could  lead  to  higher 
learning  performance  than  a  traditional  split-attention  condition. 

Participants  and  Design 

The  participants  were  1 23  first-year  undergraduate  students  (63 
men  and  60  women,  M  =  21  years  old,  SD  =  2.17)  from  a 
Zimbabwean  university.  Approval  for  human  subjects  research 


was  obtained  from  the  Human  Research  Ethics  Committee  at  the 
Zimbabwean  university.  A  power  analysis  using  the  Gpower  com¬ 
puter  program  (Faul,  Erdfelder,  Lang,  &  Buchner,  2007)  indicated 
that  a  total  sample  of  40  people  would  be  needed  to  detect  large 
effects  ( d  =  .8)  with  97%  power  using  a  t  test  between  means  with 
a  at  .05.  Participants  were  enrolled  in  13  degree  programs,  each 
taking  an  introduction  to  accounting  course.  Consented  partici¬ 
pants  were  randomly  assigned  to  one  of  the  three  conditions.  There 
were  41  students  in  the  split-attention  group  (Group  1 ;  22  men  and 
19  women,  M  =  22  years  old,  SD  =  2.22),  40  students  in  the 
integrated  group  (Group  2;  21  men  and  19  women,  M  =  20  years 
old,  SD  =  1.48),  and  42  students  in  the  guided  self-management 
group  (Group  3;  20  men  and  22  women,  M  —  21  years  old,  SD  = 
2.80).  Students  participated  voluntarily  in  the  study,  and  were  not 
paid  for  participation.  They  had  been  informed  of  the  study  1  week 
before  the  experiment  being  conducted. 

The  experiment  was  conducted  during  Week  3  of  a  13  week 
semester  with  students  studying  a  first  year  accounting  core  sub¬ 
ject.  At  the  start,  before  the  experiment  commenced,  the  research¬ 
ers  explained  the  organization  and  reasons  for  the  experiment. 
Students  were  informed  that  participation  was  voluntary  and  that 
the  results  from  the  experiment  were  not  part  of  the  subject’s 
assessment,  and  that  data  would  be  collected  anonymously.  There 
was  no  incentive  offered  to  the  participants.  Participants  were 
given  participant  information  sheets  and  consent  forms.  They 
signed  the  consent  form  stating  their  written  agreement  to  take  part 
in  the  study.  Students  who  agreed  to  participate  were  randomly 
assigned  to  one  of  the  three  conditions. 

Before  responding  to  the  test  questions,  a  pretest  questionnaire 
was  distributed.  The  participants  answered  questions  about  their 
age,  gender,  first  language,  and  knowledge  of  accounting.  This 
took  10  min  to  complete. 

The  123  participants  in  both  the  first  and  second  study  ranged  in 
gender  from  48%  to  53%  men,  and  from  47%  to  52%  women  in 
each  instructional  format  group.  The  dominant  language  spoken  by 
the  participants  is  Shona  with  over  93%  in  each  instructional 
format.  Participants’  gender  and  linguistic  homogeneity  was  thus 
apparent  across  the  groups.  All  students  had  passed  a  high  school 
formal  English  language  examination.  The  students’  language 
proficiency  was  sufficiently  high  to  respond  to  questions  in  Eng¬ 
lish.  Each  participant  was  then  tested  individually  with  researchers 
supervising  the  test.  Part  1  of  the  Experiment  took  45  min. 

Materials  and  Procedure 

The  instructional  materials  explained  the  basic  accounting  equa¬ 
tion,  the  debit  and  credit  rules,  and  their  effect  on  the  basic 
accounting  equation.  The  instructional  materials  were  obtained 
from  an  accounting  textbook  (Weygandt  et  al.,  2010,  pp.  53-54)  in 
the  form  of  split-attention,  but  formatted  as  follows  for  each  of  the 
three  conditions: 

Group  1-split  attention:  The  instructional  material  used  by 
Group  1  (split-attention  format)  was  similar  to  that  found  in 
the  textbook.  An  example  of  the  material  used  in  the  current 
study  is  illustrated  in  Figure  2. 

Group  2-integrated  group:  The  instructional  material  in 
Group  2  (the  integrated  format)  was  presented  in  a  format  that 
integrated  the  diagram  with  the  text  (see  Figure  2).  The 
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(i) 


Assets 


DR 

CR 

(+) 

(-) 

Assets 

To  increase  (+)  the  balance  in  the  asset  accounts,  you  debit  by  entering  the  amount  on  the  left  hand  side. 
To  decrease  (-)  the  balance  you  credit  by  entering  the  amount  on  the  right  hand  side. 

Debits  to  a  specific  asset  account  should  exceed  the  credits  to  that  account.  The  normal 
balance  of  an  account  is  on  the  side  where  an  increase  in  the  account  is  recorded.  Thus 
asset  accounts  normally  have  debit  balances. 

(ii) 

Assets 

DR 

(+) 

To  increase  (+)  the  balance  in  the  asset 
accounts,  you  debit  by  entering  the  amount 
on  the  left  hand  side.  The  normal  balance 
of  an  account  is  on  the  side  where  an 
increase  in  the  account  is  recorded.  Thus 
asset  accounts  normally  have  debit  balances 
Debits  to  a  specific  asset  account  should 
exceed  the  credits  to  that  account. 

Figure  2.  Example  of  conventional  split-attention  format  (i)  and  inte¬ 
grated  format  (ii)  in  accounting. 

content  was  reformatted  to  decrease  split-attention  by  bring¬ 
ing  the  text  as  close  as  possible  to  the  diagram  (integrating). 
The  integrated  material  was  developed  after  reviewing  the 
research  concerning  split-attention  (e.g.,  Agostinho  et  al., 
2013;  Ayres  &  Sweller,  2005;  Roodenrys  et  al.,  2012; 
Tindall-Ford  et  al.,  2015)  and  then  reformatting  the  instruc¬ 
tional  material.  An  example  of  the  material  used  in  the  current 
study  is  illustrated  in  Figure  2. 

Group  3-self-managed  cognitive  load:  Instructional  materials 
in  Group  3  (the  guided  self-managed  format)  were  developed 
in  such  a  way  that  it  assisted  participants  to  integrate  the 
diagram  with  the  text.  An  example  of  the  material  used  in  the 
current  study  is  illustrated  in  Figure  3.  The  material  contained 


guidance  (As  shown  in  Figure  4)  such  as:  (a)  Draw  a  circle 
around  the  information  for  each  debit  and  credit;  (b)  Draw  an 
arrow  to  link  it  to  its  corresponding  place  on  the  diagram,  (c) 
Highlight  with  a  highlighter,  or  underline,  mark  circles  on  key 
words,  number  with  a  pencil  or  pen  in  sequence  on  the 
diagram  and  on  the  text.  Participants  in  Group  3  were  explic¬ 
itly  asked  to  implement  the  guidance  before  attempting  to 
learn  the  materials.  The  techniques  for  self-management  were 
extensively  researched  by  Roodenrys  et  al.  (2012)  and  can  be 
considered  the  common,  current  method  using  self¬ 
management  of  cognitive  load. 

The  participants  in  Part  1  of  the  Experiment  worked  manually 
using  pencil  and  paper.  Part  1  of  the  Experiment  had  three  phases: 
learning  phase,  test  phase,  and  post  phase.  At  the  start  of  the 
experiment  participants  completed  a  pretest  questionnaire.  They 
received  two  A3  pages  of  learning  materials  that  contained  learn¬ 
ing  instructions.  The  learning  instructions  differed  among  the  three 
groups.  During  the  test,  as  they  completed  the  test  questions  they 
responded  to  mental  effort  rating  questions.  The  responses  helped 
us  to  evaluate  the  extent  of  recall  of  learning  content  and  transfer 
of  knowledge  by  solving  problems  in  different  situations. 

In  the  learning  phase,  the  participants  were  given  15  min  to 
review  the  learning  materials  provided  to  them.  In  the  final  phase, 
the  researcher  administered  the  test  that  was  formatted  as  a  single 
sided  A4  booklet.  The  test  consisted  of  28  recall  and  1 1  transfer 
items.  The  participants  were  given  45  min  to  complete  the  test. 

An  example  of  a  recall  question  in  the  test  phase  is  that  students 
were  asked  to  write  the  basic  accounting  equation.  Recall  ques¬ 
tions  required  students  to  retrieve  the  acquired  knowledge  (Car¬ 
penter,  2012).  An  example  of  a  transfer  question  is;  In  May, 
Company  X  records  the  transaction  by  a  debit  to  Accounts  Re¬ 
ceivable  for  $5,000  and  a  credit  to  Service  Revenues  for  $5,000. 
What  is  the  effect  of  this  entry  upon  the  accounting  equation  for 
Company  X?  Tick  the  appropriate  answer:  Assets:  Increase,  De¬ 
crease,  No  Effect;  Liabilities:  Increase,  Decrease,  No  Effect;  Eq¬ 
uity:  Increase,  Decrease,  No  Effect. 

The  transfer  questions  tested  the  ability  to  transfer  acquired 
knowledge,  and  the  demands  of  the  questions  were  higher  than 
recall  questions.  Transfer  questions  required  a  student  to  apply  the 


CR 
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To  decrease  (-)  the  balance  you  credit  by  entering 
the  amount  on  the  right  hand  side. 


>DR 

(+) 


Assets 
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Assets 

"To  increase  (+)  the  balance  in  the  asset  accounts,  you  debit  by  entering  the  amount  on  the  left  hand  side. 

_  Debits  to  a  specific  asset  account  should  exceed  the  credits  to  that  account. 

“  The  normal  balance  of  an  account  is  on  the  side  where  an  increase  in  the 
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Figure  3.  Example  of  self-management  using  arrows. 
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Self-management  Guidance  Experiment  Part  1-  Group  3 
The  specific  steps  below  will  assist  you  to  learn  the  accounting  equation  more 
effectively  by  making  use  of  your  working  memory. 

Please  complete  the  following  tasks  before  you  start  reading  the  material 
presented: 

(a)  Draw  a  circle  around  the  information  for  each  debit  and  credit 

(b)  Draw  an  arrow  to  link  it  to  its  corresponding  place  on  the  diagram.  An 
example  has  been  done  for  you. 

(c)  Highlight  with  a  highlighter,  or  underline,  mark  circles  on  key  words, 
number  with  a  pencil  or  pen  in  sequence  on  the  diagram  and  on  the  text.  An 
example  has  been  done  for  you. 

Figure  4.  Guidance  on  self-management. 


knowledge  acquired  during  instruction  to  a  novel  situation  (Mayer 
&  Wittrock,  1996).  Participants  provided  mental  effort  ratings 
after  the  learning  phase  and  after  attempting  every  question  as 
outlined  by  Paas  (1992).  Participants  wrote  answers  and  any  com¬ 
ments  they  wished  to  provide  on  the  blank  spaces  immediately 
below  the  questions.  The  researcher  collected  all  the  test  booklets 
soon  after  the  students  completed  the  tasks. 

Pilot  Study 

A  pilot  study  was  conducted  before  the  experiment.  It  aimed  at 
refining  instructional  guidance,  instructional  content,  and  time  that 
should  be  allowed  for  each  phase  of  the  studies.  Five  students  from 
the  same  university  took  part.  Those  five  students  did  not  partic¬ 
ipate  in  the  experiment.  The  time  limit,  for  both  the  learning  phase 
and  test  phase,  was  determined  in  the  pilot  study.  The  time  given 
to  complete  the  test  was  strictly  controlled  to  avoid  the  possibility 
of  a  systematic  difference  in  processing  time  between  the  split- 
attention,  integrated  and  self-managed  groups.  Research  has  dem¬ 
onstrated  that  processing  time  is  positively  related  to  recall  (Bar- 
rouillet,  Bemardin,  Portrat,  Vergauwe,  &  Camos,  2007). 

Rating  of  mental  effort.  After  students  completed  the  work¬ 
ing  through  the  instructional  materials,  they  were  asked  to  rate  the 
cognitive  load  associated  with  the  learning  task.  To  measure  per¬ 
ceived  cognitive  load,  this  study  used  Paas  and  Van  Merrienboer’s 
(1994)  9-point  subjective  cognitive  load  rating  scale.  This  is  an 
established  scale  to  measure  the  level  of  overall  cognitive  load 
(Ayres  &  Paas,  2012;  Van  Gog  &  Paas,  2008). 

Mental  effort  rankings  were  solicited  from  participants  at  the 
end  of  the  learning  phase  and  after  each  question  in  the  test.  For 
example,  “How  much  mental  effort  did  you  invest  to  learn  the 
material?”  at  the  completion  of  the  learning  phase  and  ‘How  much 
mental  effort  did  you  invest  to  answer  this  question?’  at  the  end  of 
each  test  question.  The  ratings  on  the  levels  of  mental  effort  of  the 
accounting  exercises  were  assumed  to  assess  cognitive  load  indi¬ 
rectly  (Paas  et  al.,  2003;  Van  Gog  &  Paas,  2008). 


Compliance  measures.  Compliance  was  an  additional  mea¬ 
sure  included  in  the  analysis  for  participants  allocated  to  Group  3 
(the  guided  self-managed  format)  of  Part  1  of  the  Experiment. 
Compliance  refers  to  the  participant’s  use  of  the  guidance  attached 
to  the  instructional  materials.  Evidence  of  compliance  involved 
examination  of  the  instructional  materials  (A3  sheets  of  paper)  to 
determine  if  participants  implemented  the  instructional  guidance 
(to  assist  guided  self-management).  Participants  were  considered 
“compliant”  if  they  highlighted  material  with  a  highlighter,  under¬ 
lined  material,  or  marked  circles  on  key  words  with  a  pencil  or 
pen. 

Reliability.  In  the  present  studies,  the  reliability  of  the  scale 
was  estimated  with  Cronbach’s  coefficient  a.  The  data  from  the 
recall  test  scores  and  transfer  test  scores  were  entered  and  run  in 
SPSS  using  the  reliability  analysis  function.  For  the  internal  con¬ 
sistency  check  for  the  recall  test  scores  and  transfer  test  scores,  all 
groups  were  first  combined  and  then  separated  by  treatment  group 
to  ensure  that  internal  consistency  for  all  groups  was  established. 
The  experiment  combined  results  displayed  a  high  level  of  internal 
consistency.  The  results  showed  a  recall  Cronbach’s  a  of  .946  and 
.918  for  the  transfer  task.  The  transfer  Cronbach’s  a  for  the 
experiment  was  .892  and  .882  for  the  transfer  task.  Overall,  Cron¬ 
bach’s  a  for  the  guided  self-management  of  cognitive  load  exper¬ 
iment  ranged  between  .806  and  .885  among  the  three  treatment 
groups.  The  recall  results  ranged  between  .882  and  .946  and  the 
transfer  Cronbach’s  a  ranged  between  .806  and  .892.  Gall  et  al. 
(2003)  states  that  for  research  purposes,  having  a  reliability  of  .80 
or  higher  is  considered  sufficiently  reliable. 

Results  the  Experiment 

The  data  were  analyzed  with  a  one-way  analyses  of  variance 
(ANOVAs)  with  code  1  (i.e.,  split-attention  instruction),  2  (i.e., 
integrated  instruction),  and  3  (i.e.,  guided  self-managed  instruc¬ 
tion)  representing  the  levels  of  the  between-subjects  factor  instruc¬ 
tional  format,  to  determine  its  effects  on  the  dependent  measures, 
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performance  on  recall,  transfer,  and  perceived  cognitive  load.  In 
case  of  significant  F  tests,  pairwise  post  hoc  comparisons  using 
Tukey  contrasts  were  conducted.  The  a  level  was  set  at  .05  {p  < 
.05)  when  evaluating  tests  of  statistical  significance.  To  measure 
effect  size,  Cohen’s  d  was  calculated,  with  values  of  .10,  .30,  and 
.50  characterizing  small,  medium,  and  large  effect  sizes,  respec¬ 
tively  (Cohen,  1988). 

Pretest  response  and  analysis.  A  one-way  ANOVA  was 
conducted  on  pretest  responses  to  explore  differences  across  the 
three  groups  involved  in  the  experiment.  Means  and  SDs  are 
shown  in  Table  1.  The  one-way  ANOVA  for  pretest  questions 
revealed  no  significant  main  effect  of  group  for  age;  F(2,  120)  = 
•01,  p  =  1,  and  knowledge  of  accounting  F(2,  120)  =  1.191,  p  = 
.307.  The  equivalence  of  knowledge  of  accounting  between  the 
groups  was  important  to  note,  as  no  group  came  to  the  study  with 
higher  a  priori  knowledge  of  accounting.  This  is  evidence  that  the 
three  groups  are  equivalent  on  significant  demographic  dimen¬ 
sions.  Accordingly,  statistically  significant  differences  detected 
later  are  more  likely  to  be  caused  by  differences  between  the 
treatment  conditions. 

Mental  effort  ratings.  Means  and  SDs  of  mental  effort  ratings 
in  the  learning  phase  are  shown  in  Table  1.  Results  from  the 
one-way  ANOVA  for  mental  effort  invested  in  the  learning  phase 

Table  1 

Means  and  SDs  for  Pretest  Responses,  Recall,  and  Transfer  Test 
Scores  as  a  Function  of  Instructional  Condition 


Experiment 

Part  1  Part  2 

Pretest  Mean  SD  Mean  SD 

Age 


Split-attention  ( n  =  41) 

21.20 

2.86 

Integrated  ( n  =  40) 

21.20 

2.28 

Guided  self-managed  (n 

=  42) 

21.19 

2.79 

Knowledge  of  accounting3 

Split-attention  in  =  41) 

2.10 

.37 

Integrated  (n  =  40) 

1.98 

.42 

Guided  self-managed  ( n 

=  42) 

2.02 

.27 

Recall  performance13 

Split-attention 

49.93 

13.32 

51.46 

16.84 

Integrated 

59.18 

12.33 

64.28 

15.12 

Guided  self-managed 

80.00 

14.38 

87.14 

13.63 

Transfer  performance0 

Split-attention 

36.73 

16.38 

40.71 

17.69 

Integrated 

53.30 

17.99 

61.73 

15.63 

Guided  self-managed 

66.83 

16.72 

83.21 

19.18 

Mental  effort  rating 

Learning  phase 

Split-attention 

7.90 

1.16 

7.17 

1.64 

Integrated 

6.80 

1.77 

4.73 

2.04 

Guided  self-managed 

4.02 

1.46 

3.62 

1.91 

Recall 

Split-attention 

7.49 

.98 

6.56 

1.66 

Integrated 

4.55 

1.36 

4.63 

1.44 

Guided  self-managed 

2.93 

1.33 

2.71 

1.13 

Transfer 

Split-attention 

5.88 

.93 

6.05 

1.38 

Integrated 

5.00 

.94 

3.89 

1.3 

Guided  self-managed 

3.52 

.99 

2.33 

1.21 

3  Actual  responses  were  1  to  5  for  knowledge  of  accounting,  b  actual  raw 
score  ranges  were  0  to  28  for  recall,  and  for  c  transfer  it  was  0  to  11. 


indicated  significant  differences  across  the  three  formats,  F( 2, 
120)  =  75.77,/?  <  .05,  effect  size  vfe  =  0.55.  There  were  large  and 
significant  between-groups  differences  on  mean  mental  effort  rat¬ 
ing  in  the  learning  phase,  which  indicated  that  the  guided  self- 
managed  group  reported  lower  levels  of  cognitive  load  than  the 
integrated  group.  The  perceived  amount  of  mental  effort  invested 
in  the  split-attention  format  was  higher  than  that  invested  with  the 
integrated  format. 

A  Tukey  post  hoc  test  for  learning  phase  revealed  that  the  mental 
effort  factor  was  statistically  significant,  with  the  guided  self-managed 
group  recording  the  lowest  cognitive  load  (4.02  rating,  p  <  .05) 
compared  with  the  integrated  format  group  (6.80  rating,  p  <  .05),  d  — 
1.71,  and  split-attention  format  group  (7.90  rating,/?  <  .05),  d  =  2.94. 
Tukey  post  hoc  tests  also  revealed  that  the  integrated  format  group 
reported  a  significantly  lower  level  of  perceived  cognitive  load  com¬ 
pared  with  the  split-attention  format  group,  d  =  0.74,  indicating  a 
large  effect  size. 

Performance  measures.  Table  1  shows  means  and  SDs  for 
performance  measures  in  the  experiment  based  on  one-way  ANO- 
VAs.  A  one-way  ANOVA  for  recall  scores  showed  a  significant 
main  effect  for  the  recall  test  items;  F( 2,  120)  =  54.834,  p  <  .05, 
effect  size  r)p  =  0.478.  Mean  recall  and  transfer  scores  showed  that 
the  guided  self-managed  group  had  higher  scores  than  the  inte¬ 
grated  group,  which  in  turn  had  higher  scores  than  the  split- 
attention  group.  Consistent  with  predictions,  post  hoc  comparisons 
using  Tukey  contrasts  showed  that  the  guided  self-managed  group 
performed  significantly  better  than  the  split-attention  group,  d  = 
2.17,  and  integrated  group,  d  =  1.45  with  the  integrated  group 
performing  better  than  the  split-attention  group,  d  =  0.73,  indi¬ 
cating  a  large  effect  size. 

The  one-way  ANOVA  for  transfer  questions  also  demonstrated 
a  significant  main  effect  of  group;  F( 2,  120)  =  32.478,  p  <  .05, 
and  effect  size  Tip  =  0.351.  Post  hoc  comparisons  using  Tukey 
contrasts  showed  that  the  guided  self-managed  group  performed 
significantly  better  than  the  split-attention  group,  d  =  1.82,  and  the 
integrated  group,  d  =  0.78.  Again  the  integrated  group  performed 
better  than  the  conventional  split-attention  group,  d  =  0.96. 

Mental  effort  rating  on  instruction.  After  the  learning  phase 
and  after  each  test  question,  students  were  asked  to  rank  their  effort 
in  terms  of  perceived  mental  effort  on  recall  and  transfer  questions. 
One-way  ANOVAs  were  conducted  to  determine  the  influence  of 
instructional  methods  on  recall  and  transfer  test  performance. 

Table  1  shows  the  mean  ratings  and  SDs  for  the  ratings  in  the 
test  phase.  There  were  large  and  significant  differences  arising 
from  the  different  instructional  formats  based  on  mean  values  of 
recall  results;  F(2,  120)  =  144.973,  p  <  .05,  effect  size  T|p  = 
0.707.  Instructional  format  also  differentially  affected  the  mean 
values  of  transfer  results;  F( 2,  120)  —  64.834,  p  <  .05,  effect  size 
T|p  =  0.519.  The  perceived  cognitive  load  was  significantly  lower 
in  the  integrated  group  than  in  the  split-attention  group.  Contrary 
to  expectations,  the  guided  self-managed  group  reported  signifi¬ 
cantly  lower  levels  of  perceived  cognitive  load  than  the  integrated 
group. 

Follow-up  Tukey  post  hoc  tests  for  recall  on  the  differences 
revealed  that  mental  effort  was  statistically  significant  with  the 
guided  self-managed  group  recording  the  lowest  cognitive  load 
(2.93  rating,  p  <  .05)  compared  with  the  integrated  group  (4.55 
rating,  p  <  .05),  d  =  1,21  and  split-attention  group  (7.49  rating, 
p  <  .05),  d  =  3.90.  Tukey  post  hoc  tests  also  revealed  that  the 
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integrated  group  (4.55  rating,  p  <  .05)  reported  a  significantly 
lower  level  of  perceived  cognitive  load  compared  with  the  split- 
attention  group  (7.49  rating,  p  <  .05),  d  =  1.32.  Similarly  a  Tukey 
post  hoc  test  for  transfer  items  revealed  that  the  cognitive  load 
differed  significantly  between  groups,  with  the  guided  self- 
managed  group  recording  the  lowest  cognitive  load  (3.52  rating, 
p  <  .05)  compared  with  the  integrated  group  (5.00  rating ,p<  .05), 
d  =  1.53,  and  the  split-attention  group  (5.88  rating,  p  <  .05),  d  = 
2.45.  A  Tukey  post  hoc  test  also  revealed  that  the  integrated  group 
(5.00  rating,  p  <  .05)  reported  a  significantly  lower  level  of 
cognitive  load  than  the  conventional  split-attention  group  (5.88 
rating,  p  <  .05),  d  =  0.93. 

Guidance  compliance.  Results  of  the  compliance  measures 
indicated  that  95%  participants  followed  the  guidance  about  how 
to  self-manage  split  attention.  Compliance  referred  to  the  partici¬ 
pant’s  use  of  the  guidance  attached  to  the  instructional  materials 
for  Group  3.  Participants  were  considered  “compliant”  if  they 
performed  at  least  one  of  the  tasks  suggested,  which  were  high¬ 
lighting  material  with  a  highlighter,  using  arrows  to  link  text  and 
diagram,  underlining  material,  or  drawing  circles  to  mark  key 
words  with  a  pencil  or  pen. 

The  most  common  strategy  used  by  the  participants  was  high¬ 
lighting,  underlining,  or  numbering  (86%).  The  second  most  used 
strategy  was  drawing  an  arrow  to  link  it  to  its  corresponding  place 
on  the  diagram.  Only  36%  of  the  participants  drew  a  circle  around 
the  information.  The  use  of  at  least  these  tasks  is  seemingly  quite 
useful  in  understanding  the  instructional  material.  The  level  at 
which  these  tasks  were  conducted  suggests  that  full  utilization  of 
the  guidance  contributed  to  higher  performance  scores. 

Part  2  of  the  Experiment 

The  aim  of  the  transfer  task  was  to  reinquire  the  existence  of 
split-attention  with  new  learning  materials  and  test  the  transfer  of 
guided  self-management  skills  to  a  new  learning  domain.  Part  2 
of  the  experiment  used  the  same  procedure  and  the  same  partici¬ 
pants  as  Part  1  to  test  the  robustness  of  a  possible  “self-  manage¬ 
ment  effect.”  The  instructional  materials  were  changed,  and  learn¬ 
ers  were  not  specifically  instructed  anymore  to  follow  a  certain 
procedure. 

Materials  and  Procedure 

The  instructional  materials  were  about  the  topic  of  ratio  analy¬ 
sis.  In  the  test  phase  there  were  three  pages  of  recall  and  transfer 
test  questions  to  be  answered,  including  a  requirement  to  rate 
mental  effort  after  answering  every  test  question.  The  first  set  of 
instructional  materials  was  taken  directly  from  the  textbook 
(Weygandt  et  al.,  2010,  pp.  783-785).  This  constituted  the  split- 
attention  format.  The  second  set  of  instructional  materials  was 
adjusted  to  reflect  the  integrated  format  and  for  the  third  set,  the 
guided  self-management  format,  no  guidance  was  given.  The 
students  in  the  guided  self-management  format  group  were  ex¬ 
pected  to  utilize  the  guided  self-management  techniques  they 
gained  from  Part  1  of  the  Experiment.  Part  2  of  the  experiment 
proceeded  directly  after  Part  1 . 

Rating  of  mental  effort.  Similar  to  Part  1  of  the  experiment, 
a  9-point  subjective  cognitive  load  rating  scale  was  used. 

Compliance  measures.  Compliance  measures  involved  anal¬ 
ysis  of  the  participant’s  use  of  highlighting,  underlining,  or  mark¬ 


ing  circles  on  key  words  with  a  pencil.  Evidence  of  compliance 
involved  examination  of  the  instructional  materials  to  determine  if 
participants  implemented  the  guidance. 

Results  of  Part  2  of  the  Experiment 

Mental  effort  ratings  in  the  learning  phase.  Means  and  SDs 
of  mental  effort  ratings  for  learning  phase  are  shown  in  Table  1. 
Results  from  the  one-way  ANOVA  for  mental  effort  invested  in 
the  learning  phase  indicated  significant  differences  across  the  three 
formats,  F( 2,  120)  =  39.04,  p  <  .05,  effect  size  qj;  =  0.394. 
Consistent  with  predictions,  there  were  large  and  significant 
between-groups  differences  on  mean  mental  effort  rating  in  the 
learning  phase.  Mean  learning  phase  ratings  showed  that  the 
guided  self-managed  group  reported  lower  levels  of  cognitive  load 
than  the  integrated  group.  The  perceived  amount  of  mental  effort 
invested  by  the  split-attention  group  was  higher  than  that  invested 
by  the  integrated  group. 

Tukey  post  hoc  tests  for  the  learning  phase  revealed  that  mental 
effort  differed  significantly  between  the  groups,  with  the  guided 
self-managed  group  recording  the  lowest  cognitive  load  (3.62 
rating,  p  <  .05)  compared  with  the  integrated  group  (4.73  rating, 
p  <  .05),  d  =  0.55,  and  the  split-attention  group  (7.17  rating,  p  < 
.05),  d  =  1.99.  Tukey  post  hoc  tests  also  revealed  that  the  inte¬ 
grated  group  reported  a  significantly  lower  level  of  perceived 
cognitive  load  than  the  split-attention  group,  d  =  1.32,  indicating 
a  large  effect  size. 

Performance  measures.  Two  separate  one-way  ANOVAs 
were  conducted  on  recall  and  transfer  test  performance  scores  to 
explore  differences  between  the  three  groups  involved  in  the 
transfer  task.  Means  and  SDs  for  recall  and  transfer  test  scores  are 
shown  in  Table  1. 

A  one-way  ANOVA  for  recall  scores  revealed  a  significant 
effect  of  group;  F(2,  120)  =  63.825,  p  <  .05,  effect  size  qj;  = 
0.515.  Mean  recall  and  transfer  scores  showed  that  the  guided 
self-managed  group  had  higher  scores  than  the  integrated  group, 
which  also  had  higher  scores  than  the  split-attention  group.  Post 
hoc  comparisons  using  Tukey  contrasts  showed  that  the  guided 
self-managed  group  performed  significantly  better  than  the  other 
two  groups.  The  integrated  group  performed  significantly  better 
than  the  split-attention  group. 

The  one-way  ANOVA  for  transfer  questions  also  demonstrated 
a  significant  main  effect  of  group;  F( 2,  120)  =  60.721,  p  <  .05, 
effect  size  qj;  =  0.503.  Post  hoc  comparisons  using  Tukey  con¬ 
trasts  showed  that  the  guided  self-managed  group  outperformed 
both  the  split-attention  and  integrated  group.  The  integrated  group 
also  performed  significantly  better  than  the  split-attention  group. 

Mental  effort  rating  on  the  test.  A  one-way  ANOVA  was 
conducted  on  the  instructional  rating  (of  mental  effort  required) 
that  the  participants  were  asked  to  provide  after  the  learning  phase 
and  after  answering  every  question.  The  means  and  SDs  for  recall 
and  transfer  mental  effort  rating  for  the  test  phase  are  shown  in 
Table  1.  Results  indicated  significant  effect  of  group  on  recall 
items;  F( 2,  120)  =  75.477  p  <  .05,  effect  size  q^  =  0.557.  Mean 
recall  and  transfer  ratings  showed  that  the  guided  self-managed 
group  reported  lowest  levels  of  cognitive  load,  followed  by  the 
integrated  group  and  the  split-attention  group. 

Post  hoc  comparisons  using  Tukey  contrasts  showed  that  the 
guided  self-managed  group  reported  a  significantly  lower  mental 
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effort  (2.71  rating,  p  <  .05)  than  the  split-attention  group  (6.56 
rating,  p  <  .05),  d  =  2.47  and  integrated  group  (4.63  rating,  p  < 
.05),  d  =  1.71.  In  turn  the  integrated  group  reported  significantly 
lower  mental  effort  than  the  split-attention  group,  d  =  0.80, 
indicating  a  large  effect  size. 

One-way  ANOVA  for  transfer  test  items  revealed  a  significant 
main  effect  of  group  F( 2,  120)  =  85.925,  p  <  .05,  effect  size  Tip  = 
0.589.  Post  hoc  comparisons  using  Tukey  contrasts  showed  that 
the  guided  self-managed  group  reported  lower  mental  effort  (2.33 
rating,  p  <  .05)  than  the  split-attention  group  (6.05  rating,  p  < 
.05),  d  =  2.86  and  the  integrated  group  (3.89  rating,  p  <  .05),  d  = 
1.24.  The  integrated  group  (3.89  rating,  p  <  .05)  reported  signif¬ 
icantly  lower  mental  effort  than  the  split-attention  group  (6.05 
rating,/?  <  .05),  d  =  1.61. 

Guidance  compliance.  Results  of  the  compliance  measures 
for  the  transfer  task  indicated  that  41  of  the  42  participants  (98%) 
followed  the  guidance  about  how  to  self-manage  split  attention. 
Participants  were  considered  compliant  if  they  performed  at  least 
one  of  the  tasks  provided  that  were  highlighting  material  with  a 
highlighter,  using  arrows  to  link  text  and  diagram,  underlining 
material,  or  marking  circles  on  key  words  with  a  pencil  or  pen. 

The  most  common  strategy  used  by  the  participants  (90%)  was 
highlighting,  underlining  or  numbering.  The  second  most  used 
strategy  was  drawing  an  arrow  to  link  it  to  its  corresponding  place 
on  the  diagram.  Only  43%  of  the  participants  drew  a  circle  around 
the  information.  The  use  of  at  least  these  tasks  is  seemingly  quite 
useful  in  understanding  the  instructional  material.  The  level  at 
which  these  tasks  were  conducted  suggests  that  full  utilization  of 
the  guidance  contributed  to  higher  performance  scores. 

Summary  of  Part  2  of  the  Experiment 

The  finding  of  higher  transfer  performance  scores  by  students  in 
the  guided  self-managed  group  compared  with  those  in  the  con¬ 
ventional  split-attention  group  was  clearly  evident.  The  superiority 
of  the  guided  self-managed  group  might  have  resulted  from  the 
implementation  of  the  guidance  on  how  to  integrate  text  and 
diagrams  before  learning  the  instructional  material.  The  require¬ 
ment  to  mentally  integrate  text  with  relevant  aspects  of  the  dia¬ 
gram  by  the  split  attention  group,  which  had  no  guidance,  might 
have  contributed  to  poor  performance  by  the  split  attention  group. 
The  results  of  performance  scores  on  transfer  items  are  similar  to 
those  found  by  Roodenrys  et  al.  (2012)  and  Tindall-Ford  et  al. 
(2015).  Such  superior  performance  had  been  demonstrated,  for 
example  by  Roodenrys  et  al.  (2012),  with  Australian  students 
studying  educational  psychology,  by  showing  slightly  increased 
accuracy  of  students  in  the  guided  self-management  group  over  the 
integrated  group  with  transfer  test  items. 

For  compliance  measures,  more  than  94%  followed  the  guid¬ 
ance  offered.  The  performance  of  the  guided  self-management 
group  may  be  attributed  to  the  guidance  given  during  the  learning 
phase.  The  guidance  to  the  self-management  group  improved 
performance  across  the  two  performance  measures  of  recall,  trans¬ 
fer  and  reported  low  cognitive  load. 

Part  2  of  the  Experiment  was  designed  to  follow  up  on  the 
results  observed  in  Part  1  of  the  Experiment  and  test  whether 
participants  would  spontaneously  transfer  self-management  skills 
to  new  and  different  split-attention  instructional  materials.  If  there 
is  skills  transfer,  would  this  lead  to  a  reduction  in  extraneous  load 


and  therefore  enhance  performance?  Another  important  question  is 
if  skills  transfer  occurred,  can  this  then  be  termed  a  self¬ 
management  strategy?  We  contend  that  if  learners  are  able  to 
remember  a  skill  and  remember  when  to  use  it  on  their  own,  initial 
instruction  was  successful  to  automatize  the  self-management 
skills. 

Students  in  the  guided  self-management  group  demonstrated 
higher  performance  than  students  in  the  split-attention  for  both 
recall  and  transfer  tasks.  These  results  again  demonstrate  the 
robustness  of  the  split-attention  effect.  The  participants  in  the 
guided  self-managed  group  self-managed  before  attempting  to 
learn  the  materials  that  improved  their  performance  on  test  items. 
Participants  in  the  integrated  and  split-attention  groups  who 
learned  the  same  material  but  had  no  self-management  knowledge 
performed  worse  that  the  guided  self-managed  group  across  all 
performance  measures.  The  split-attention  effect  is  further  showed 
by  the  results  of  the  integrated  group  that  had  higher  performance 
scores  than  the  split-attention  group. 

With  regard  to  cognitive  load  in  the  transfer  task,  students  in  the 
guided  self-managed  instructional  group,  contrary  to  expectations, 
reported  lower  perceived  cognitive  load  than  students  in  the  inte¬ 
grated  group.  In  turn  students  in  the  integrated  group  reported 
lower  perceived  cognitive  load  than  students  in  the  split-attention 
group.  Apparently,  the  processes  required  to  work  during  the  test 
phase  demanded  different  amounts  of  mental  effort  in  all  condi¬ 
tions.  When  the  data  is  differentiated  for  recall  and  transfer,  the 
results  still  revealed  the  same  tendency  with  the  split  attention 
group  and  integrated  groups  reporting  higher  levels  of  cognitive 
load. 

The  results  of  cognitive  load  do  not  support  the  hypothesis  that 
a  cognitive  structure  resulting  from  guided  self-management  in¬ 
struction  improves  learning  over  one  resulting  from  instruction 
emphasizing  a  conventional  split-attention  format.  The  high  cog¬ 
nitive  load  experienced  by  the  split-attention  group  resulted  in  the 
group  investing  less  effort  in  more  relevant  learning  processes, 
consequently  performing  poorly  in  recall  and  transfer  tasks. 

General  Discussion 

The  aim  of  this  study  was  to  investigate  the  effect  novice 
students’  guided  self-management  by  physically  manipulating 
paper-based  instructional  materials  on  learning  accounting.  This 
was  examined  in  Part  2  of  the  Experiment  with  different  account¬ 
ing  materials. 

The  major  finding  from  this  study  relates  to  the  students’  ability 
to  learn  to  manage  cognitive  load  created  by  instructional  material 
that  requires  them  to  split  their  attention  between  diagram  and  text. 
As  a  precursor  to  demonstrating  the  self-management  skills,  it  was 
necessary  to  demonstrate  that  the  accounting  instructional  materi¬ 
als  do  indeed  demonstrate  such  split-source  format  and  that  this 
has  a  negative  effect  on  learning.  Both  studies  presented  in  this 
study  showed  that  when  split  attention  was  managed  by  the  stu¬ 
dents  by  integrating  text  and  diagrams,  students  consistently  out¬ 
performed  those  in  the  split-attention  and  integrated  groups. 

In  terms  of  test  performances,  the  findings  of  the  experiment 
strongly  supported  the  hypothesis  that  students  learn  better  when 
guided  to  self-manage  instructional  material  rather  than  learning 
content  under  split- attention.  Therefore,  the  evidence  of  split- 
attention  within  the  learning  materials  was  established.  The  inte- 
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grated  format  group  performed  significantly  better  than  the  tradi¬ 
tional  split-attention  format  group  in  terms  of  recall  and  transfer 
performance  tests.  Findings  showed  that  the  guided  self-managed 
format  outperformed  the  integrated  format  on  recall  and  transfer 
test  items.  In  practice  students  are  often  confronted  with  split- 
attention  materials.  If  they  are  able  to  integrate  them,  they  can 
learn  more.  The  students  who  have  learned  to  integrate  can  do  well 
on  these  tasks,  whereas  students  who  have  worked  with  integrated 
examples  do  not  know  what  to  do. 

In  terms  of  mental  effort  (perceived  cognitive  load)  experienced 
during  learning,  there  were  significant  differences  across  the  three 
groups.  Participants  studying  the  guided  self-managed  format  con¬ 
sistently  reported  lower  cognitive  load  than  the  split-attention 
format.  Our  hypotheses  was  not  confirmed  as  students  studying 
under  the  guided  self-managed  format  reported  significantly  lower 
cognitive  load  than  students  studying  under  the  integrated  format 
group.  Finally,  students  in  the  integrated  format  group  reported 
lower  levels  of  cognitive  load  than  students  in  the  split-attention 
format  group.  A  possible  explanation  of  the  lower  cognitive  load 
in  the  guided  self-managed  group  may  be  that  the  guidance  pro¬ 
vided  by  the  self-management  prompts  may  have  affected  the 
overall  perception  of  cognitive  load.  Future  studies  could  investi¬ 
gate  the  contribution  of  the  different  instructional  components  to 
the  students’  perceived  cognitive  load.  The  recently  developed 
more  detailed  rating  scales  of  Leppink,  Paas,  Van  der  Vleuten,  Van 
Gog,  and  Van  Merrienboer  (2013;  see  also  Leppink,  Paas,  Van 
Gog,  Van  der  Vleuten,  &  Van  Merrienboer,  2014),  could  be  used 
to  disentangle  these  contributions. 

The  transfer  of  skills  through  self-management  load  techniques  is  a 
promising  aspect  that  will  enhance  learners’  performance  in  learning 
accounting.  The  present  study  reinforces  the  importance  for  instruc¬ 
tors  not  just  to  design  material  according  to  CLT  principles,  but  to 
present  instructional  formats  in  a  way  students  can  easily  navigate. 

A  possible  concern  of  the  current  research  was  the  difference 
between  the  participants  in  the  group  taught  to  self-manage  split 
attention  and  the  other  two  groups.  In  Part  1  of  the  Experiment  the 
students  in  the  self-management  group  were  guided  on  how  to 
move  text  close  to  a  diagram.  In  Part  2  of  the  Experiment  the 
students  in  the  guided  self-management  group  were  expected  to 
use  this  guidance  before  answering  the  questions.  It  is  possible  that 
this  difference  in  instruction  contributed  to  the  difference  in  per¬ 
formance  between  the  groups. 

It  should  be  noted  that  we  used  an  overall  measure  of  cognitive 
load,  which  is  of  restricted  informational  value,  because  it  does  not 
allow  the  experimenter  to  conclude  on  the  different  types  of  cognitive 
load.  Future  studies  could  use  the  more  detailed  measures  developed 
by  Leppink  et  al.,  (2013;  see  also  Leppink  et  al.,  2014).  Another 
limitation  of  this  study  is  that  a  task  may  feel  difficult  for  a  variety  of 
reasons  other  than  cognitive  load,  such  as  difficulty  in  finding  and 
combining  the  appropriate  or  selected  strategies. 

Implications  for  Instructors  and  Learners 

In  formulating  instructional  prescriptions  from  the  present  re¬ 
search,  we  took  into  consideration  previous  findings  such  as  those 
by  Roodenrys  et  al.  (2012),  who  concluded  that  managing  the 
split-attention  format  enhances  performance.  Roodenrys  et  al. 
(2012)  also  revealed  the  potential  of  students  enhancing  their 
performance  by  self-management  of  diagram  and  texts.  Other 


studies  have  reported  the  deleterious  effect  of  the  split-attention 
effect  on  learning  (e.g.,  Roodenrys  et  al.,  2012;  Tindall-Ford  et  al., 
2015).  It  is  important  to  note  that  in  the  current  study,  the  benefit 
of  integrated  material  was  even  more  apparent  when  participants 
were  instructed  to  integrate  the  materials  for  themselves. 

In  light  of  these  recent  findings,  the  results  of  the  present  study 
support  the  conclusion  that  instruction  with  emphasis  on  self¬ 
management  of  the  instructional  material  is  an  appropriate  alter¬ 
native  to  conventional  split-attention  instruction.  At  the  same  time, 
caution  is  needed  when  considering  this  strategy  since  several 
studies  have  concluded  that  the  self-management  strategy  has  to  be 
carefully  implemented  to  enhance  learning  (e.g.,  Roodenrys  et  al., 
2012;  Tindall-Ford  et  al.,  2015).  For  example,  guidance  on  self¬ 
management  has  to  be  provided  for  instructional  materials  that  are 
designed  to  evoke  split-attention  that  cannot  be  understood  in 
isolation.  The  guided  self-management  techniques  learners  may 
utilize  involve  numbering,  linking  text  with  diagrams  using  ar¬ 
rows,  or  highlighting  keywords  or  concepts. 

Many  of  the  learning  activities  that  novice  undergraduate  account¬ 
ing  students  engage  with  in  the  classroom,  whether  related  to  reading, 
calculations,  or  other  areas  of  studying  accounting,  impose  consider¬ 
able  burdens  on  the  limited  capacity  of  working  memory.  These 
activities  often  require  a  student  to  hold  in  mind  some  information 
(e.g.,  a  text)  while  attempting  to  match  with  relevant  parts  of  a 
diagram.  This  is  something  that  cognitive  load  theory  and  this  study 
argue  to  be  mentally  challenging  and  not  facilitating  learning,  yet 
these  are  the  kinds  of  activities  that  novice  learners  struggle  with. 
Therefore,  instructors  may  need  to  design  instructional  materials  that 
are  already  integrated  or,  maybe,  guide  students  with  crucial  infor¬ 
mation  to  self-manage  by  properly  integrating  relevant  study  material 
to  facilitate  learning  as  presented  in  this  study. 

While  this  study  enables  us  to  suggest  ways  in  which  instructors 
can  help  learners  achieve  greater  success  early  in  their  accounting 
undergraduate  courses,  it  may  also  assist  learners  to  solve  prob¬ 
lems  by  manipulating  instructional  materials  by  themselves.  Stu¬ 
dents  can  be  taught  how  to  navigate  through  accounting  lectures, 
studying  for  examinations  and  various  other  learning  activities 
using  guided  self-management  skills. 

Instructional  Implications  for  Textbook  Writers 

Numerous  examples  exist  of  instructional  material  not  designed 
according  to  cognitive  load  theory  principles  in  the  area  of  intro¬ 
ductory  accounting.  As  illustrated  in  this  study,  students  often 
encounter  instructional  material  such  as  the  statement  of  financial 
position  (balance  sheet),  statement  of  cash  flows,  and  statement  of 
changes  in  equity.  The  general  format  involves  a  diagrammatic 
representation  with  text  below  or  above  the  diagrammatic  repre¬ 
sentation.  An  alternative  instructional  presentation  would  be  to 
have  the  text  embedded  in  the  diagrammatic  presentation  that  is 
referred  to  as  integrated;  and,  as  this  study  has  shown,  this  would 
reduce  the  extraneous  cognitive  load  and  enhance  learning. 

Another  conventional  way  of  presentation,  again  using  the  example 
of  a  balance  sheet,  is  to  visualize  the  balance  sheet  in  the  form  of  an 
equation.  The  equation  explained  within  the  text  would  be  that  total 
assets  equals  liabilities  plus  owners’  equity.  Looking  at  the  equation  in 
this  way  shows  how  assets  were  financed;  either  by  borrowing  money 
(liability)  or  by  using  the  owners’  money  (owners’  or  shareholders’ 
equity).  However,  most  balance  sheets  do  not  have  the  equation 
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depicted,  and  students  are  usually  forced  to  have  a  mental  represen¬ 
tation  of  the  equation  in  their  minds  as  they  try  to  make  sense  of  the 
assets  in  one  section  and  the  liabilities  and  net  worth  in  the  other 
section  that  would  make  the  sections  "balance."'  Such  type  of  presen¬ 
tation  exerts  an  unnecessary  load  on  working  memory. 

Conclusion 

In  conclusion,  we  think  that  the  results  of  our  work  are  prom¬ 
ising  for  an  underdeveloped  area  of  “guided  self-management 
effect.”  Like  Roodenrys  et  al.’s  work  (Roodenry  et  al.,  2012),  we 
found  that  the  split  attention  has  a  negative  effect  on  learning.  In 
addition  we  found  that  when  learners  self-manage  instructional 
material  it  enhances  learning.  Theoretically,  this  has  important 
implications  with  regard  to  evidence  that  learners  can  be  instructed 
on  how  to  self-manage  rather  than  relying  on  the  instructor  to  keep 
on  reorganizing  deficient  instructional  formats  such  as  split  atten¬ 
tion  formats  into  more  effective  formats  such  as  integrated  for¬ 
mats.  From  a  practical  perspective,  these  results  are  important  for 
students,  instructors  and  textbook  writers.  For  students  who  fre¬ 
quently  encounter  split-attention  learning  material  they  can  take 
control  of  their  own  cognition  and  learning.  The  results  of  this 
study  reinforce  the  importance  of  instructors  to  present  instruc¬ 
tional  formats  in  a  way  that  students  can  easily  navigate  for  guided 
self-management.  Despite  the  potential  revealed  by  these  studies, 
for  guided  self-management  to  be  successful,  the  onus  still  rests 
with  the  instructor  to  guide  students  to  manage  the  load.  At  the 
same  time,  the  research  also  raises  new  questions  about  the  need 
for  further  research  to  establish  the  robustness  of  the  student-led 
guided  self-management  effect  and  finding  other  methods  students 
can  be  instructed  to  use  to  integrate  separate  text  and  diagrams  to 
enhance  learning.  Future  research  is  also  needed  to  find  other 
methods  students  can  be  instructed  to  use  to  integrate  separate  text 
and  diagrams  to  enhance  learning. 
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Learning  to  Read  With  and  Without  Feedback,  In  and  Out  of  Context 


Sandra  Martin-Chang 

Concordia  University 


The  self-teaching  hypothesis  posits  that  enduring  orthographic  and  phonological  representations  are 
produced  when  children  independently  recode  print  into  speech.  However,  very  little  research  has 
examined  how  children  self-teach  when  initial  decoding  attempts  are  weak  or  ineffective.  In  this 
within-participant  design,  25  students  in  Grade  2  learned  to  read  85  different  words  in  4  conditions. 
Words  were  read  in  and  out  of  context,  with  and  without  feedback.  Accuracy  rates  were  recorded 
throughout  5  training  sessions  (2  word  repetitions  per  session  =  10  repetitions  in  total).  A  posttest  was 
administered  after  a  6-day  delay  by  reinstating  the  training  materials.  At  the  end  of  training,  the  highest 
accuracy  scores  were  observed  when  children  read  in  context/feedback  followed  by  when  they  read  in 
isolation/feedback,  and  then  in  context/no  feedback;  the  lowest  accuracy  scores  were  observed  when 
children  read  in  isolation/no  feedback.  This  pattern  remained  over  the  retention  period,  suggesting  that 
external  support  from  feedback,  and  top-down  support  from  context,  can  help  create  word  representations 
in  memory.  The  results  are  discussed  in  relation  to  the  importance  of  whole- word  phonology  within 
self-teaching. 

Keywords:  context,  decoding,  feedback,  isolation,  self-teaching,  word  reading 


Share  (2004)  noted  that  children  “self-teach”  the  majority  of  the 
words  they  can  read.  The  ability  to  self-teach  is  associated  with 
two  potential  factors.  The  first  factor  relates  to  the  process  of 
recoding;  namely,  focusing  on  grapheme-to-phoneme  correspon¬ 
dences  during  decoding  may  help  to  create  well-specified  ortho¬ 
graphic  representations  in  memory  (Share,  1995).  The  second 
factor  relates  to  the  product  of  recoding;  in  this  case,  focusing 
simultaneously  on  whole-word  orthography  and  phonology,  may 
help  amalgamate  written  words  with  their  spoken  pronunciations 
(Elbro,  de  Jong,  Houter,  &  Nielsen,  2012).  When  decoding  is 
successful,  it  is  difficult  to  disentangle  the  effects  of  these  two 
factors:  proficient  decoding  results  in  the  word’s  correct  pronun¬ 
ciation.  However,  ineffective  decoding  creates  the  opportunity  to 
examine  the  second  possible  factor  more  closely.  For  example, 
when  decoding  skills  are  weak,  or  the  word  to  be  read  is  difficult, 
correct  pronunciations  are  more  likely  to  be  activated  when  chil¬ 
dren  read  in  context.  Moreover,  when  decoding  fails  to  produce  the 
spoken  word  altogether,  correct  pronunciations  can  be  provided 
via  feedback  from  a  “teacher.”  The  current  experiment  explored 
the  second  potential  factor  involved  in  self-teaching  by  examining 
children’s  growth  in  reading  accuracy  as  they  read  in,  and  out,  of 
context — with,  and  without,  corrective  feedback. 
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The  Self-Teaching  Hypothesis 

The  notion  behind  self-teaching  is  that  children  build  up  “sight 
word”  lexicons — words  that  can  be  read  automatically — by  creat¬ 
ing  orthographic  representations  as  they  decode  words.  In  essence, 
“sounding  out”  new  words  induces  a  form  of  cognitive  processing 
akin  to  “focal  attention”  (Samuels,  1967)  because  children’s  at¬ 
tention  is  focused  on  the  letters,  letter  patterns,  and  letter  sequenc¬ 
ing  that  make  up  each  word’s  unique  orthographic  form  (Share, 
1999).  This  process  allows  children  to  gradually  accumulate  gen¬ 
eral  orthographic  knowledge  (knowledge  about  the  language  as  a 
whole),  by  internalizing  orthographic  representations  of  many 
specific  words  (e.g.,  slowly  coming  to  understanding  that  the  “ss” 
spelling  pattern  is  legal  at  the  end,  but  not  beginning  of  English 
words  through  exposure  with  words  such  as  less  and  sell,  boss, 
sob,  etc.).  Support  for  this  hypothesis  has  come  from  several 
empirical  investigations  showing  robust  word-specific  ortho¬ 
graphic  gains  after  a  minimal  number  of  successful  reading  expe¬ 
riences  for  both  nonwords  (Bowey  &  Muller,  2005;  Cunningham, 
Perry,  Stanovich,  &  Share,  2002;  Ouellette,  2010;  Ouellette  & 
Fraser,  2009;  Share,  1999;  Wang,  Castles,  Nickels,  &  Nation, 
2011)  and,  to  a  lesser  extent,  real  words  (Cunningham,  2006; 
Landi,  Perfetti,  Bolger,  Dunlap,  &  Foorman,  2006). 

Ziegler,  Perry,  and  Zorzi  (2014)  recently  simulated  the  self¬ 
teaching  process  by  building  a  connectionist  model  around  two 
main  assumptions:  (a)  that  young  children  begin  to  read  with  a 
well-developed  spoken  vocabulary;  and  (b)  that  before  reading 
begins  in  earnest,  children  are  explicitly  taught  phonics.  The 
Phonological  Decoding  Self-Teaching  Model  quickly  learned  how 
to  read  more  than  25,000  words  and  was  able  to  generalize  this 
learning  to  read  nonwords.  Critical  to  the  self-teaching  hypothesis, 
it  was  able  to  accomplish  these  tasks  without  the  use  of  feedback 
from  a  teacher.  Interestingly  context  was  also  seen  to  play  a  key 
role  in  the  model.  For  example,  when  the  correct  word  was 
generated  as  one  of  several  possible  word  candidates,  it  was 
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chosen  based  on  the  premise  that  in  “real  learning  situations  with 
real  texts,  children  have  additional  information  from  the  story 
context,  semantics,  or  syntax  to  help  them  choose  the  correct 
target”  (p.  6).  In  short,  given  alternative  pronunciations,  Ziegler  et 
al.,  suggest  that  the  ability  to  select  the  correct  pronunciation  is 
attributed  to  additional  information  gained  from  reading  in  context. 

Self-Teaching  in  Context 

The  self-teaching  hypothesis  is  aligned  with  decades  of  research 
showing  the  importance  of  explicit  and  systematic  teaching  of 
letter-sound  correspondences  (Snowling  &  Hulme,  2011).  Indeed, 
a  core  tenet  of  self-teaching  is  that  children  need  to  understand  the 
alphabetic  principle  so  that  decoding  can  take  place.  The  self¬ 
teaching  hypothesis  also  acknowledges  that  most  word  knowledge 
is  learned  implicitly  via  experience  with  texts  (e.g.,  Elgort,  Perfetti, 
Rickies,  &  Stafura,  2015;  Nagy,  Anderson,  &  Herman,  1987). 

Words  in  naturalistic  text  are  difficult  to  predict  (Gough,  Alford, 
&  Holley-Wilcox,  1981);  therefore,  when  children  read  authentic 
texts  they  seldom  rely  on  guessing  from  context  as  their  primary 
strategy  for  reading  (Nation  &  Snowling,  1998;  Tunmer  &  Chap¬ 
man,  1995).  Still,  under  certain  conditions  predicting  words  from 
context  becomes  more  likely.  For  example,  when  reading  highly 
constrained  texts  (e.g.,  “Roses  are  red,  violets  are  _”),  individuals 
are  less  likely  to  fixate  on  the  predictable  words.  Moreover,  when 
the  predicable  words  are  fixated,  the  amount  of  time  spent  looking 
at  them  tends  to  be  shorter  in  duration  compared  with  those  that  are 
less  predictable  (Ehrlich  &  Rayner,  1981).  Under  normal  circum¬ 
stances,  however,  rather  than  simply  “guessing”  from  context, 
children  use  both  partial  decoding  attempts  and  the  structure  of  the 
text  to  arrive  at  correct  word  pronunciations  (Nation  &  Snowling, 
1998;  Share,  2004;  Tunmer  &  Chapman,  2012).  The  advantage 
gained  from  this  type  of  crosschecking  between  semantics  and 
print  has  been  termed  “contextual  facilitation.” 

The  beneficial  effects  of  reading  in  context  have  been  well 
documented  (Martin-Chang  &  Levy,  2006;  Martin-Chang,  Levy, 
&  O’Neil,  2007;  Roth  &  Perfetti,  1980;  Stanovich,  Nathan,  West, 
&  Vala-Rossi,  1985).  What  remains  contentious  is  whether  the 
performance  gains  observed  during  online  contextual  reading, 
accumulate  to  result  in  crystalized  learning  that  is  capable  of 
supporting  later  reading  fluency  (for  the  distinction  between  online 
performance  and  crystallized  learning  see  Byrne  et  al.,  2013). 

Landi  and  colleagues  (2006)  were  interested  in  this  distinction 
between  reading  performance  and  generalized  learning.  They  iden¬ 
tified  two  written  word  sets  that  children  in  Grades  1  and  2 
(Experiment  1)  were  unable  to  accurately  read  aloud.  Half  of  the 
words  were  then  shown  in  isolation,  and  the  other  half  were 
presented  in  predicable  sentences.  Landi  et  al.  reported  that  chil¬ 
dren  were  able  to  read  more  words  in  predictable  sentences  than  in 
isolation  (13.29  words  in  context  compared  with  4.88  words  in 
isolation).  However,  both  sets  of  words  were  read  with  similar 
accuracy  when  presented  in  isolation  one  week  later  (6.15  words  in 
context,  5.3  words  in  isolation).  This  led  Landi  et  al.  to  conclude 
that  children  performed  better  in  context  initially,  but  that  reading 
words  in  isolation  was  superior  for  learning.  However,  as  de¬ 
scribed  previously,  very  highly  constrained  text,  which  is  generally 
not  representative  of  children’s  more  naturalistic  contextual  read¬ 
ing,  encourages  top-down  processing  and  reduces  the  need  for 
students  to  focus  on  the  print.  Therefore,  Landi  et  al.’s  selection  of 


text  may  have  inadvertently  reduced  the  opportunities  for  self¬ 
teaching  in  context. 

Cunningham’s  work  (Cunningham,  2006)  addressed  some  of 
these  issues  by  inviting  children  in  Grade  1  to  read  four  coherent 
and  four  scrambled  passages  that  were  longer  and  less  predictable 
than  the  materials  used  by  Landi  et  al.  (2006).  Cunningham  re¬ 
ported  that  words  were  read  more  accurately  when  they  were 
initially  presented  in  a  meaningful  context  (83.6%)  compared  with 
in  a  scrambled  passage  (67%).  However,  3  days  after  self-teaching, 
the  children  performed  similarly  on  an  orthographic  choice  task 
and  a  spelling  task,  regardless  of  whether  the  words  were  first  read 
in  context  or  isolation.  A  posttest  measure  of  reading  accuracy  was 
not  included — perhaps  because  there  were  only  eight  words  to  be 
learned  in  total  (one  word  per  passage).  Therefore,  the  results  of 
this  study  cannot  be  directly  compared  with  those  of  Landi  et  al., 
in  terms  of  word  reading  accuracy. 

Self-teaching  in  and  out  of  context  was  also  explored  in  two 
experiments  by  Wang  et  al.  (2011).  The  authors  created  a  set  of 
words  that  were  understood  verbally  by  associating  nonwords  with 
meanings.  After  the  pronunciations  and  definitions  of  the  non¬ 
words  were  well  understood  orally,  they  were  presented  in  writing. 
The  spellings  of  the  nonwords  remained  the  same  over  two  exper¬ 
iments;  however,  the  pronunciations  associated  with  the  non  words 
differed.  In  Experiment  1  the  pronunciations  were  regular  while  in 
Experiment  2,  they  were  irregular.  Wang  et  al.  reported  that  the 
regular  words  were  read  more  accurately  in  context  during  the  first 
self-teaching  trial  and  that  the  irregular  words  were  read  more 
accurately  across  all  four  contextual  training  trials.  They  con¬ 
cluded  that  when  decoding  is  difficult,  as  it  is  for  irregular  words, 
context  helps  children  read  words  accurately.  However,  like  Cun¬ 
ningham  (2006),  Wang  et  al.  used  a  small  learning  set  (four  words 
per  condition).  Therefore,  it  is  unclear  whether  their  findings 
would  generalize  to  reading  longer  passages. 

Reviewing  the  work  of  Martin-Chang  and  Levy  (2005)  can  help 
resolve  some  of  these  questions.  They  presented  average  readers  in 
Grade  2  (Experiment  2)  with  85  real  words  read  in  a  meaningful 
story  and  85  different  words  read  in  isolation.  The  authors  found 
that  the  children  read  the  target  words  more  accurately  in  context 
compared  with  in  isolation  during  training.  They  also  found  that 
children  read  new  stories  faster  and  more  accurately  if  the  words 
in  the  stories  had  first  been  trained  in  a  different  context.  However, 
Martin-Chang  and  Levy  departed  from  the  methodology  of  other 
researchers  by  electing  to  give  corrective  feedback  in  response  to 
children’s  errors;  therefore,  their  results  cannot  be  directly  com¬ 
pared  with  the  self-teaching  literature  (e.g.,  Cunningham,  2006; 
Landi  et  al.,  2006;  Wang  et  al.,  2011). 

Self-Teaching  and  Feedback 

A  defining  characteristic  of  the  self-teaching  model  is  that 
feedback  from  external  sources  is  not  required  for  word  acquisi¬ 
tion.  However,  very  little  research  has  compared  how  self-teaching 
in  the  absence  of  an  expert  compares  to  learning  with  the  assis¬ 
tance  of  feedback. 

When  the  child  receives  whole  word  feedback,  the  “whole 
word”  is  supplied  after  an  error.  This  type  of  feedback  is  often 
negatively  contrasted  with  graphophonemic  feedback,  which  re¬ 
lates  individual  letters  or  letter  clusters  with  specific  sounds. 
Whole  word  feedback— also  termed  “terminal  feedback”— has 
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been  faulted  for  ending  the  children’s  decoding  attempts,  (Evans, 
Barraball,  &  Eberle,  1998).  Phonological  recoding  has  been  spec¬ 
ulated  to  be  a  crucial  step  in  learning  how  to  read  words  fluently 
(Share,  2004),  consequently,  if  receiving  assistance  discourages 
children  from  actively  recoding,  or  ends  the  recoding  process 
prematurely,  it  might  also  impair  the  quality  of  word  representa¬ 
tions  in  memory.  Indeed,  Landi  and  colleagues  (Landi,  2013; 
Landi  et  al.,  2006)  have  speculated  that  students  will  be  less  likely 
to  spend  cognitive  resources  during  initial  decoding  attempts  when 
they  know  the  correct  pronunciation  is  forthcoming.  If  this  is  the 
case,  feedback  would  be  expected  to  reduce  exhaustive  grapheme- 
phoneme  recoding,  to  the  detriment  of  self-teaching. 

Without  question,  a  teacher  or  parent  who  “supplies”  a  misread 
word  is  doing  the  important  work  of  recoding  the  symbols  into 
speech  on  the  child’s  behalf;  yet  it  is  possible  that  hearing  the 
pronunciation  while  attending  to  the  print  might  put  the  child  in  a 
better  position  to  phonologically  recode  the  word  on  subsequent 
trials.  As  argued  by  Ehri  (2014),  when  “readers  see  a  new  word 
and  say  or  hear  its  pronunciation,  its  spelling  becomes  mapped 
onto  its  pronunciation  and  meaning”  (Ehri,  2014,  p.  6).  Barbetta, 
Heward,  Bradley,  and  Miller  (1994)  provided  evidence  to  support 
this  view  by  teaching  five  students  in  Grade  2  to  read  words  in 
isolation  with  immediate  or  delayed  whole-word  feedback.  The 
results  showed  that  immediate  feedback  was  more  profitable  than 
delayed  feedback  in  the  acquisition  and  maintenance  of  word 
reading.  The  authors  speculated  that  the  immediate  feedback  re¬ 
duced  the  likelihood  that  the  same  mistakes  were  repeated 
throughout  training.  Additionally,  immediate  feedback  allowed  the 
pronunciation  to  be  heard  soon  after  the  print  was  seen,  which  may 
have  contributed  to  the  amalgamation  of  the  words’  phonological 
and  orthographic  forms. 

Current  Investigation 

The  literature  discussing  self-teaching  stresses  the  importance  of 
each  “successful  recoding  experience”  as  if  it  was  one  unified 
entity  (Cunningham,  2006;  Share,  1999,  2004).  However,  within 
every  fruitful  decoding  attempt  there  are  two  factors  that  could  be 
contributing  to  self-teaching:  (a)  grapheme-phoneme  recoding; 
and  (b)  the  pairing  of  whole  word  orthography  and  phonology. 
This  pairing  of  complete  spoken  words  with  their  respective  letter 
strings  could  be  achieved  in  at  least  three  ways:  by  pure  bottom-up 
decoding  (the  first  factor  in  self-teaching);  by  decoding  that  is 
supplemented  with  top-down  support  from  context;  or  by  feedback 
that  is  provided  by  an  external  source  after  a  failed  decoding 
attempt.  The  question  is,  does  providing  support  (from  context  or 
feedback)  weaken  or  strengthen  long-term  word  recognition?  If 
independent  grapheme-phoneme  recoding  is  critical  to  self¬ 
teaching,  then  situations  where  there  is  very  little  support  (e.g., 
reading  in  isolation  without  feedback)  would  be  expected  to  pro¬ 
duce  the  highest  degree  of  accuracy  over  training.  In  contrast,  if 
pairing  whole-word  orthography  and  phonology  is  central  to  cre¬ 
ating  word  representations  in  memory  (Ehri,  2014),  then  situations 
that  offer  the  most  support  for  reading  accuracy  (e.g.,  reading  in 
context  with  feedback)  should  result  in  superior  accuracy.  The 
current  investigation  tested  these  hypotheses  by  having  students 
read  a  large  set  of  words,  in  and  out  of  meaningful  text,  with  and 
without  feedback. 


Method 

Design 

A  within-subjects  design  was  implemented  with  two  experimen¬ 
tal  manipulations:  the  availability  of  context  and  the  provision  of 
feedback.  Specifically,  the  first  manipulation  involved  whether  the 
target  words  were  presented  in  isolation  or  in  context.  The  second 
manipulation  involved  whether  whole-word  feedback  was  pro¬ 
vided  or  whether  no  feedback  was  given.  Taken  together,  the 
training  phase  of  the  study  consisted  of  four  distinct  experimental 
reading  conditions:  context/feedback,  isolation/feedback,  con¬ 
text/no  feedback,  and  isolation/no  feedback.  Participants  were  ex¬ 
posed  to  four  unique  sets  of  target  words,  which  were  counterbal¬ 
anced  across  all  four  experimental  conditions  (see  Figure  1).  The 
experiment  was  conducted  in  two  18-day  blocks.  In  one  block,  the 
children  were  given  feedback  during  both  the  context  and  isolated 
word  training  conditions.  During  the  other  block,  the  children  did 
not  receive  feedback  during  either  context  or  isolated  word  train¬ 
ing.  Each  block  contained  five  training  sessions  (on  Days  1,  3,  7, 
9,  and  11).  A  posttest  was  administered  on  the  last  day  of  each 
block  (Day  18)  to  determine  if  accuracy  gains  would  be  main¬ 
tained  over  a  delay. 

Participants 

Twenty-eight  participants  were  recruited  from  three  Grade  2 
classrooms  in  central  Canada.  The  children’s  teachers  all  reported 
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Figure  1.  Experimental  training  design,  Story  =  context,  and  list  = 
isolation.  *  The  order  of  the  feedback  conditions  and  the  training  conditions 
was  counterbalanced  over  all  participants,  so  that  half  of  the  children 
received  the  no-feedback  training  condition  first,  and  the  other  half  re¬ 
ceived  the  story  condition  first.  In  addition,  the  training  words  were 
counterbalanced  over  all  conditions  so  that  each  set  of  words  was  trained 
in  each  of  the  four  conditions.  **  Each  list  and  story  contained  two 
repetitions  of  85  different  training  words.  Therefore,  the  children  read  a 
total  of  170  unique  words  (half  in  a  story,  half  in  a  list),  twice,  each  day. 
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using  strong  phonics  training  (e.g.,  Jolly  Phonics)  in  their  regular 
classroom  instruction.  Children  with  parental  consent  were 
screened  with  the  reading  subtest  of  the  Wide  Range  Achievement 
Test — Third  Edition  (WRAT3;  Wilkinson,  1993).  In  total,  three 
participants  were  omitted  from  the  study.  The  first  student  had  a 
low  standardized  WRAT  score  (<70),  the  second  was  absent  for 
an  extended  period  of  time,  and  the  third  was  shy  and  preferred  not 
to  read  aloud.  The  final  sample  consisted  of  25  students  (17  girls 
and  8  boys).  Participants  were  in  Grade  2  and  were,  on  average,  7 
and  half  years  old  (M  =  7  years  and  7  months,  range  =  6  years  and 
7  months  to  8  years  and  3  months).  The  average  standardized 
WRAT  score  was  97  ( SD  =  14,  range  from  75  to  130).  After  each 
session,  the  children  were  thanked  with  small  gifts  (e.g.,  stickers, 
pencils).  They  were  also  invited  to  choose  an  age  appropriate  book 
to  take  home  after  each  18-day  block. 

Materials 

To  use  longer  and  more  complex  passages  for  the  children,  four 
nonoverlapping  sets  of  85  real  words  were  counterbalanced  over 
the  four  training  conditions  (340  unique  words  in  total/4  condi¬ 
tions  =  85  words  per  condition).  The  words  were  selected  because 
they  were  contained  in  7-to-8-year-olds’  spoken  vocabularies 
(Martin-Chang  &  Levy,  2005).  For  the  present  study,  six  Grade  2 
teachers  (including  a  classroom  teacher  from  the  current  sample) 
also  vetted  the  materials  to  confirm  that  they  contained  words 
children  could  understand  orally.  The  teachers  indicated  that  on 
average,  only  3.6%  of  the  340  words  might  be  difficult  for  children 
in  Grade  2  to  understand;  the  remaining  96.4%  of  the  words  were 
deemed  to  be  common  in  children’s  spoken  vocabularies. 

The  materials  included  words  with  both  regular  and  irregular 
spellings  and  words  with  a  range  of  morphological  complexities 
(see  Appendix  A).  The  lists  were  not  equated  in  terms  of  spelling 
patterns  or  orthographic  complexity;  however,  the  average  number 
of  letters  (average  =  5.7  letters,  range  5.5  letters  to  5.9  letters)  and 
morphemes  (average  =  1.44  morphemes,  range  1.4-1. 5)  per  list 
were  controlled  for.  The  items  were  not  screened  to  avoid  addi¬ 
tional  word  exposures  (see  also  Cunningham,  2006);  therefore,  it 
was  difficult  to  track  how  many  of  the  words  could  already  be  read 
before  the  first  session. 

The  isolation  reading  condition  was  achieved  by  presenting  the 
words  individually  on  a  computer  screen.  The  target  items  were 
repeated  twice  within  each  isolated-word  list  to  produce  a  total  of 
170  target  words.  To  create  the  context  condition,  four  narratives 
were  created  around  each  of  the  lists  (see  Appendix  B  for  an 
example).  The  target  words  were  repeated  twice  within  each  train¬ 
ing  story  to  produce  a  total  of  170  target  words.  The  four  stories 
contained  some  written  words  (approximately  30-40%)  that  were 
above  the  children’s  reading  ability  and  were  all  evaluated  to  be  a 
4.0  grade  level  by  the  Flesh-  Kincaid  readability  test.  They  were 
written  to  resemble  children’s  passages  and  ranged  from  675  to 
766  words  in  length.  As  would  be  the  case  within  authentic  texts, 
there  was  differing  amount  of  contextual  support  for  different 
words. 

Procedure 

Two  training  conditions  were  conducted  during  each  18-day 
block  (see  Figure  1).  It  was  speculated  that  receiving  feedback  on 


only  some  trials  might  create  unnecessary  confusion  for  the  chil¬ 
dren  (e.g.,  they  might  wait  for  feedback  on  trials  where  no  feed¬ 
back  was  forthcoming).  Therefore,  the  feedback  manipulation  was 
administered  in  separate  blocks.  The  order  of  the  blocks  was 
counterbalanced  so  that  half  of  the  children  received  feedback  in 
both  conditions  (context  and  isolation)  during  the  first  block  and 
the  other  half  of  the  children  received  feedback  in  both  conditions 
(context  and  isolation)  during  the  last  block.  Within  each  block, 
the  order  of  the  conditions  was  counterbalanced  so  that  context 
was  presented  first  for  half  of  the  children  and  presented  last  for 
the  rest.  It  is  important  to  note  that,  while  the  context  and  isolation 
manipulation  occurred  on  the  same  days  (either  in  the  feedback  or 
no  feedback  blocks),  each  of  the  four  reading  conditions  corre¬ 
sponded  to  different  word  sets.  The  delayed  posttest  occurred  on 
Day  18  of  each  block. 

Two  trained  research  assistants  worked  with  the  children  in  a 
quiet  place  in  their  school.  The  first  experimenter  conducted  the 
training  phase,  while  a  second  experimenter,  who  was  blind  to  the 
feedback  condition  used  during  training,  conducted  the  posttest. 

Training  phase.  The  students  were  given  praise  and  encour¬ 
agement  throughout  the  entire  experiment  (not  contingent  on  cor¬ 
rect  reading).  However,  during  the  feedback  block,  the  children 
were  also  given  whole-word  corrective  feedback  after  reading 
errors.  In  the  no  feedback  block,  the  participants  were  asked  to 
read  as  if  they  were  alone;  here,  they  were  not  given  assistance  of 
any  kind.  Within  each  block  (feedback/no  feedback),  the  children 
read  two  material  sets;  each  set  included  85  different  words.  The 
children  read  one  story  (context)  and  one  list  (isolation)  during 
each  training  session.  The  stories  and  lists  contained  two  repeti¬ 
tions  of  each  word. 

During  isolated-word  training,  individual  words  were  presented 
on  a  computer  screen  for  2  s,  followed  by  a  fixation  point.  The  rapid 
pace  of  word  presentation  kept  the  task  from  becoming  overly  long 
and/or  monotonous  and  the  frequent  encouragement  and/or  feed¬ 
back  from  the  experimenter  kept  the  task  from  feeling  completely 
solitary.  To  ensure  that  the  lists  did  not  appear  too  “story  like”  (i.e., 
not  rapid  serial  visual  presentation)  the  words  were  shown  in  a 
fixed-  randomized  order,  with  the  only  stipulation  being  that  no 
word  was  presented  twice  in  a  row  (85  words  X  2  repetitions  = 
170  words).  If  the  children  were  in  the  feedback  block,  the  exper¬ 
imenter  provided  whole-word  feedback  after  errors.  The  children 
were  not  asked  to  repeat  the  corrected  word.  Pauses  longer  than  2  s 
were  considered  omissions.  Inaccurate  attempts  and  omissions 
were  both  marked  as  errors.  If  the  child  was  in  the  no  feedback 
block,  training  continued  without  interruption.  The  experimenter 
discreetly  coded  the  answer  as  correct  or  incorrect  during  the 
fixation  cross,  and  prompted  the  onset  of  the  next  word  by  pressing 
a  computer  key.  The  sessions  were  audio  recorded  so  that  the 
scoring  sheets  could  be  verified  for  accuracy. 

During  context  training,  a  shared  reading  paradigm  was  adopted 
where  the  participants  read  only  the  target  words  (85  X  2  repeti¬ 
tions  =  170  words  per  session).  The  shared  reading  paradigm 
equated  the  task  demands  of  the  training  conditions  (isolation  and 
context).  The  children  were  asked  to  follow  along  with  the  story 
and  read  the  words  that  were  bolded  and  underlined  while  the 
experimenter  read  the  remainder  of  the  story.  This  style  is  similar 
to  one  that  might  be  adopted  by  a  parent  who  pauses  to  let  the 
children  read  some  of  the  words  in  a  “daily  reading”  story  at  home. 
The  experimenter  read  at  a  natural  pace,  pausing  at  each  target 
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word.  If  the  child  was  in  the  feedback  condition,  the  experimenter 
would  read  the  target  word  after  the  child  made  an  error  or  failed 
to  attempt  the  word  after  2  s.  The  story  continued  without  the  child 
repeating  the  corrected  word.  In  the  no-feedback  condition,  the 
experimenter  would  resume  with  the  story  after  errors.  Once  again, 
the  sessions  were  audiotaped  for  scoring  purposes.  The  experi¬ 
menter  discreetly  coded  the  child’s  reading  as  correct  or  incorrect 
as  they  were  reading  and  inaccurate  attempts  and  failures  to 
respond  were  both  marked  as  errors. 

Testing  phase.  The  delayed  posttest  occurred  on  the  last  day 
of  the  1 8-day  testing  block.  During  the  posttest,  the  children  were 
asked  to  read  the  same  materials  that  they  had  used  during  their 
training  phase  (i.e.,  same  contextual  story  and  same  isolation  word 
list)  to  determine  if  the  reading  accuracy  gains  from  the  training 
sessions  had  been  retained.  The  only  difference  between  the  train¬ 
ing  phase  and  the  posttest  was  that  the  children  were  not  offered 
assistance,  regardless  of  whether  they  had  been  in  the  feedback 
condition  during  training.  If  the  child  paused  for  more  than  2  s,  the 
experimenter  reassured  the  child  that  it  was  fine  to  “skip  it”  or 
“keep  going.” 

Results 

Word  Reading  in  Session  1 

As  shown  in  Figure  2,  the  children  were  able  to  read  many  of  the 
target  words  at  the  onset  of  training.  Accuracy  ranged  from  61  to 
71%  during  the  first  session.  To  determine  if  there  were  any  initial 
differences  between  the  conditions,  a  2  (context,  isolation)  X  2 
(feedback,  no  feedback)  repeated  measures  Analysis  of  variance 
(ANOVA)  was  conducted  on  the  accuracy  scores  from  the  first 
training  session.  The  significant  main  effect  of  context  confirmed 
that  children  were  able  to  read  more  words  correctly  in  context 
compared  with  in  isolation  at  the  beginning  of  training,  F(l,  24)  = 
39.73,  MSE  =  4,121.64,  p  <  .001.  The  difference  between  word 
reading  in  context  versus  in  isolation  corresponded  to  a  very  large 
effect  (r  =  .79).  There  was  no  main  effect  of  feedback,  indicating 
the  accuracy  scores  in  Session  1  were  similar  regardless  of  whether 
the  children  were  in  the  feedback  or  no-feedback  condition  (p  = 
.15).  The  Context  X  Feedback  interaction  was  not  significant  {p  = 
.65).  Similar  accuracy  scores  were  expected  in  the  two  feedback 
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Figure  2.  Percentage  of  words  read  correctly  during  the  five  training 
sessions  and  posttest. 


conditions  because  the  supplementary  instruction  had  not  had  time 
to  exert  influence  on  training. 

Word  Reading  in  Sessions  1-5 

The  number  of  words  that  could  be  read  inclusively  from 
Sessions  1-5  was  also  analyzed.  Figure  2  depicts  the  percentage  of 
words  read  correctly  during  each  session.  The  highest  accuracy 
scores  from  Session  2  onward  were  noted  when  children  read  in 
context/feedback.  The  lowest  scores  were  observed  when  they  read 
in  isolation/no  feedback.  Figure  2  also  shows  that  when  children 
read  in  isolation/feedback,  they  started  training  with  reduced  ac¬ 
curacy,  but  by  Session  3,  the  children’s  scores  in  the  isolation/ 
feedback  condition  had  surpassed  the  context/no-feedback  condi¬ 
tion. 

To  evaluate  whether  participants’  word  reading  accuracy  im¬ 
proved  across  training  sessions,  and,  if  so,  whether  this  change  was 
modulated  by  the  experimental  reading  conditions,  a  2  (context, 
isolation)  X  2  (feedback,  no  feedback)  X  5  (Sessions:  1-5)  re¬ 
peated  measures  ANOVA  was  conducted.  Inspection  of  Mauchly’s 
test  indicated  that  the  assumption  of  sphericity  had  been  violated 
for  the  main  effect  of  Session,  x2(9)  =  103.25,  p  <  .001,  and  the 
interaction  between  Feedback  and  Session,  x2(9)  =  65.31,  p  < 
.001.  The  degrees  of  freedom  associated  with  these  effects  were 
corrected  using  Greenhouse-Geisser  conservative  estimates  of 
sphericity  (e  =  .31  for  the  main  effect  of  Session  and  e  =  .40  for 
the  Feedback  X  Session  interaction;  Field,  2009).  Results  from  the 
ANOVA  analysis  confirmed  significant  main  effects  of  context 
(F(l,  24)  =  35.51,  MSE  =  21,496.58,  p  <  .001,  r  =  .77)  and 
feedback  (F(l,  24)  =  39.43,  MSE  =  37,243.48,  p  <  .001,  r  =  .79), 
both  of  which  corresponded  to  large  effect  sizes  (Cohen,  1988). 
However,  the  Context  X  Feedback  interaction  was  not  significant 
( p  —  .236).  Furthermore,  the  significant  main  effect  of  session 
showed  that  children  were  able  to  read  a  greater  number  of  words 
as  they  participated  in  additional  practice  sessions,  F(  \  .26, 
30.33)  =  65.24,  MSE  =  31,879.71,  p  <  .001.  Planned  pairwise 
comparisons,  with  Bonferroni  adjustments  for  multiple  compari¬ 
sons,  confirmed  that  each  session  was  associated  with  significantly 
greater  learning  in  word  reading  compared  with  the  performance  in 
the  previous  session  (all  pairwise  ps  <  .003).  More  importantly, 
this  main  effect  was  qualified  by  a  significant  Feedback  X  Session 
interaction,  F(1.58,  37.89)  =  36.34,  MSE  —  5813.50,  p  <  .001, 
indicating  that  feedback  resulted  in  accelerated  learning  over  ses¬ 
sions. 

Planned  comparisons  using  the  “repeated”  contrast  function  was 
used  to  examine  the  interaction  between  Feedback  and  Session 
more  carefully.  Repeated  contrasts  are  especially  useful  in  a 
repeated-measures  design  in  which  the  level  of  a  variable  has  a 
meaningful  order  (e.g.,  Session  1,  2,  3,  etc.;  Field,  2009).  In  terms 
of  the  current  study,  the  repeated  contrasts  compared  the  perfor¬ 
mance  of  each  session  to  the  performance  at  the  previous  session 
(Session  3  vs.  2),  and  evaluated  the  degree  of  learning  in  relation 
to  the  effects  of  feedback.  At  nearly  every  successive  practice 
session,  the  Feedback  condition  was  associated  with  significantly 
greater  learning  in  word  reading  accuracy  compared  with  the  No 
Feedback  condition  (all  ps  <  .004;  Session  1  vs.  2,  r  =  .76; 
Session  2  vs.  3,  r  —  .65;  Session  3  vs.  4,  r  =  .56).  Although  there 
was  a  significant  improvement  in  reading  accuracy  between  Ses¬ 
sions  4  and  5  overall,  the  improvement  was  statistically  similar  for 
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the  Feedback  and  No  Feedback  condition  (Session  4  vs.  5,  p  =  .06. 
The  Context  X  Session  (p  =  .50),  and  the  Context  X  Feedback  X 
Session  interactions  (p  =  .19)  were  not  significant,  indicating  that 
the  initial  benefit  of  reading  in  context  was  maintained  over  all  the 
training  sessions.1 

The  contextual  advantages  observed  throughout  training  raised 
the  question  of  whether  the  effects  were  being  driven  by, a  small 
number  of  highly  irregular  items  within  the  larger  word  set.  To 
gain  insight  into  this  question,  the  data  were  reorganized  to  be 
examined  from  a  within-word  perspective.  Simply  put,  this  pre¬ 
liminary  analysis  shed  light  on  how  many  times  the  same  word 
was  read  more  accurately  (by  different  children)  when  it  was 
presented  in  context  compared  with  when  it  was  presented  in 
isolation.  Reorganizing  the  data  in  this  manner  resulted  in  a  less- 
than-ideal  data  set  that  was  very  large  (85  words  X  4  lists  X  2 
forms  of  feedback  =  680  items)  with  very  few  observed  values  per 
item  (approximately  six  participants  per  condition).  Nevertheless, 
this  exploratory  inspection  of  the  data  revealed  that  when  read  with 
feedback,  215/340  words  were  read  more  accurately  in  context, 
106/340  words  were  read  more  accurately  in  isolation,  and  18/340 
words  were  read  equally  accurately  in  context  and  isolation.  A 
Wilcoxon  signed-ranks  test  found  these  differences  to  be  signifi¬ 
cantly  different  (z  =  7.324,  p  <  .001).  An  uncannily  similar 
pattern  was  observed  when  the  words  were  read  without  feedback: 
213/340  words  were  read  more  accurately  in  context,  106/240 
words  were  read  more  accurately  in  isolation,  and  20/340  words 
were  tied.  Once  again,  a  Wilcoxon  signed-ranks  test  found  these 
differences  to  be  significantly  different  (z  =  8.202,  p  <  .001). 
Although  the  original  study  was  not  designed  with  these  analyses 
in  mind,  the  descriptive  results  suggest  that  the  context  effects  are 
not  being  driven  by  a  small  number  of  words. 

Delayed  Posttest 

On  the  18th  day  of  each  training  block,  the  children  were  once 
again  presented  with  the  training  materials  to  measure  retention  of 
learning.  This  follow-up  task  was  administered  to  determine  if  the 
reading  gains  made  during  training  would  be  maintained  over  a 
6-day  period.  As  depicted  in  Figure  2,  the  general  patterns  favoring 
context  and  feedback  were  also  observed  during  the  posttest  (con¬ 
text/feedback  =  93.98%,  isolation/feedback  =  87.93%,  context/no 
feedback  =  81.45,  isolation/no  feedback  =  73.89).  Furthermore, 
the  accuracy  scores  improved  over  the  retention  period,  indicating 
that  the  children  treated  the  delayed  posttest  like  an  additional 
training  session.  A  2  (context,  isolation)  X  2  (feedback,  no  feed¬ 
back)  X  2  (Session  5,  posttest)  repeated  measures  ANOVA  con¬ 
firmed  that  all  three  main  effects  were  significant  and  corre¬ 
sponded  to  large  effect  sizes  (Cohen,  1988):  context  (F(l,  24)  = 
24.38,  MSE  =  7,267.92,  p  <  .001,  r  =  .71),  feedback  (F(l,  24)  = 
48.51,  MSE  =  30,582.28 ,p  <  .001,  r  =  .82),  and  time  of  test  (F(l, 
24)  =  7.69,  MSE  =  1,793.13,  p  —  .011,  r  =  .49).  The  absence  of 
any  significant  interactions  (all  ps  >  .29)  suggested  that  the  effects 
of  context  and  feedback  were  similar  at  both  the  end  of  training 
and  during  the  retention  task. 

Discussion 

Successful  decoding  experiences  contain  two  closely  linked 
components,  each  of  which  could  be  contributing  to  the  formation 


of  orthographic  representations  in  memory.  The  first  factor  in¬ 
volves  the  act  of  decoding,  which  can  only  function  if  the  reader 
is  attending  to  the  orthographic  details  of  the  word.  The  second 
factor  involves  the  result  of  decoding,  which  happens  when  the 
reader  accesses  the  word’s  spoken  pronunciation.  Thus  far,  the 
lion’s  share  of  research  on  self-teaching  has  been  dedicated  to 
exploring  the  decoding  process  (the  first  factor),  with  fewer  re¬ 
sources  directed  at  examining  the  importance  of  matching  whole- 
word  phonology  and  orthography  (the  second  factor).  Granted, 
when  decoding  activates  the  correct  spoken  word,  these  two  fac¬ 
tors  become  tightly  woven  and  any  additional  contribution  beyond 
decoding  is  difficult  to  measure.  However,  when  decoding  is 
ineffective — as  it  was  for  30  to  40%  of  the  words  read  in  Session 
1 — it  creates  an  opportunity  to  experimentally  bolster  the  second 
factor  and  measure  resulting  changes  in  word  learning.  In  the 
current  experiment,  self-teaching  was  tracked  while  children  read 
under  two  conditions — namely,  context  and  feedback — hypothe¬ 
sized  to  increase  the  availability  of  word  pronunciations.  If  the 
availability  of  whole-word  phonology  contributes  to  the  creation 
of  word  specific  representations  in  memory,  then  higher  learning 
rates  would  be  expected  in  these  more  supportive  conditions. 

When  contemplating  the  two  feedback  conditions  it  becomes 
apparent  that  reading  was  more  accurate  when  words  were  pre¬ 
sented  in  context  compared  with  in  isolation.  From  the  second 
session  onward,  the  highest  accuracy  scores  were  observed  when 
children  read  with  the  benefit  of  both  context  and  feedback;  this 
pattern  remained  unchanged  over  a  6-day  delay.  This  may  suggest 
that  generating  and/or  saying  the  pronunciation  aids  children  more 
than  just  hearing  the  correct  pronunciation  of  the  word  after  an 
error.  Such  an  interpretation  fits  nicely  with  the  work  of  Ehri,  who 
has  found  accelerated  learning  rates  when  children  were  asked  to 
read  words  out  loud  compared  with  silently  (see  Ehri,  2014  for 
review).  Alternatively,  it  could  suggest  that  children  are  simply 
more  successful  when  reading  in  context,  and  that  this  advantage 
is  maintained  even  with  the  additional  support  of  feedback.  The 
present  experiment  is  not  able  to  tease  apart  these  alternative 
hypotheses,  however,  in  either  case,  the  data  clearly  show  that 
training  is  more  effective  when  children  have  access  to  the  words’ 
pronunciations,  and  that  effortful  grapheme-by-phoneme  decoding 
is  not  the  only  process  for  creating  word  representations  in  mem¬ 
ory. 

Context  also  helped  children  read  more  accurately  when  they 
read  without  feedback.  During  the  first  no-feedback  session,  chil¬ 
dren  read  12.36  more  words  in  context  than  they  did  in  isolation. 
This  contextual  benefit  was  largely  maintained  throughout  the 
duration  of  training.  Indeed,  it  took  children  five  sessions  reading 
in  isolation  without  feedback  (1 16.43  words)  to  achieve  the  same 
degree  of  accuracy  as  they  had  on  the  first  day  when  reading  in 
context  without  feedback  (116.64  words).  The  benefits  observed 
from  context  also  remained  stable  over  a  6-day  delay.  Framing 
these  findings  in  terms  of  the  two  factors  hypothesized  to  impact 
self-teaching  helps  to  explain  how  the  top-down  support  from 
context  aids  in  long-term  word  learning.  The  contextual  facilitation 
effect  describes  how  children  use  partial  word  decodings  and 


The  same  pattern  of  results  was  found  when  only  words  that  were  read 
incorrectly  in  Session  1  were  included  in  the  analysis  in  a  2  (context, 
isolation)  X  (feedback,  no  feedback)  X  4  (Sessions  2-5)  ANOVA. 
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contextual  constraints  to  improve  children’s  immediate  word  read¬ 
ing  performance.  The  byproduct  of  increased  reading  accuracy  is 
greater  access  to  whole-word  phonology  and  thereby,  more  oppor¬ 
tunities  to  amalgamate  speech  and  print,  and  hence,  greater  learn¬ 
ing  (Ehri,  2014;  Share,  2004). 

Thus  far,  it  seems  that  reading  with  both  forms  of  support 
(context  and  feedback)  offers  greater  benefits  than  reading  without 
either  of  these  scaffolds  (isolation  without  feedback).  However, 
what  if  only  one  form  of  support  was  available?  Do  children  gain 
more  from  reading  in  context  (without  feedback)  or  by  reading 
with  feedback  (out  of  context)?  This  is  an  interesting  question 
from  a  theoretical  standpoint  because  of  the  fact  that  feedback 
pairs  a  word’s  verbal  pronunciation  with  its  written  form  on  every 
repetition  (either  the  child  reads  the  word  correctly,  or  the  pronun¬ 
ciation  is  provided).  In  contrast,  context  allows  a  greater  propor¬ 
tion  of  words  to  be  read  correctly  initially,  but  children  will  not 
generate/hear  the  correct  pronunciations  of  all  of  the  words  on 
every  trial.  Therefore,  if  seeing  and  hearing  words  in  close  suc¬ 
cession  is  a  driving  force  behind  gains  in  accuracy  across  sessions, 
then  receiving  feedback  should  produce  greater  benefits  than  read¬ 
ing  in  context.  This  prediction  was  bom  out  in  the  current  study. 
Word  reading  in  the  isolation/feedback  condition  improved  at  a 
faster  rate,  and  ultimately  surpassed  the  context/no  feedback  read¬ 
ing  scores. 

The  present  study  elected  to  use  whole  word  feedback,  which  is 
generally  regarded  as  less  helpful  than  graphophonemic  feedback 
during  parent-child  reading  activities  (Evans  et  al.,  1998).  The  fact 
that  marked  reading  improvements  were  found  after  even  the  least 
effective  form  of  guidance  was  given,  makes  a  convincing  case  for 
the  use  of  feedback.  It  should  be  noted,  however,  that  the  children 
in  the  current  study  were  in  Grade  2  and  they  already  had  a  basic 
foundation  of  decoding  skills.  Therefore,  these  results  might  not  be 
replicated  with  a  younger  group  of  children. 

Conjectures  have  been  raised  that  independently  generating  pho¬ 
nology  from  print  may  provide  ideal  circumstances  for  long-term 
learning.  A  strong  interpretation  of  this  statement  suggests  that 
corrective  feedback  may,  in  fact,  be  disadvantageous  to  reading 
development  (Landi,  2013).  The  data  reported  here  do  not  support 
this  hypothesis.  Children  experienced  the  most  difficulty  when 
they  were  reading  in  isolation  without  feedback.  However,  it  is 
worth  highlighting  that  even  under  the  least  supportive  conditions, 
children  were  able  to  learn  more  than  12  words  over  and  above  the 
104  words  that  were  read  correctly  during  the  first  session  of 
isolated-word  reading  without  feedback.  Therefore,  the  current 
study  adds  to  the  literature  (cf.,  Share,  1995,  1999,  2004)  by 
showing  that  children  are  capable  of  independently  using  recoding 
skills  to  self-teach  with  the  help  of  nothing  more  than  pure  decod¬ 
ing.  However,  they  learned  more  than  triple  as  many  words  at  the 
end  of  training  when  they  were  given  whole-word  feedback  in 
isolation  compared  with  when  they  were  left  to  read  in  isolation 
alone.  These  data  suggest  that  hearing  the  spoken  word  soon  after 
an  unsuccessful  reading  attempt  may  help  children  form  partial 
orthographic  representations  in  memory  that  can  be  referenced  and 
refined  on  future  word  encounters. 

When  contemplating  the  role  of  feedback  in  this  investigation  it 
is  important  to  note  that  children  were  actively  attempting  to 
decode  the  words  before  hearing  the  word  pronunciations.  The 
same  pattern  of  results  would  not  be  expected  if  the  children  did 
not  yet  understand  the  alphabetic  principle  or  if  they  were  simply 


“following  along”  while  the  teacher  read  the  text.  In  this  sense, 
even  with  feedback,  the  main  components  of  self-teaching  remain 
intact.  That  is,  for  feedback  to  result  in  lasting  improvements,  it  is 
hypothesized  that  the  child  must  still  attempt  to  recode  the  word  in 
relation  to  the  print — in  this  case,  the  pronunciation  provided  by 
the  teacher  may  simply  act  as  a  catalyst  to  jumpstart  future  recod¬ 
ing  opportunities. 

The  findings  reported  here  are  not  in  complete  agreement  with 
other  studies  in  the  self-teaching  literature  (e.g.,  Landi  et  al.,  2006; 
Nation,  Angell,  &  Castles,  2007;  Ricketts,  Bishop,  Pimperton,  & 
Nation,  2011;  Wang  et  al.,  2011,  Experiment  1);  however,  there 
are  several  differences  between  this  study  and  those  reported 
previously  that  merit  consideration.  The  first  involves  whether  the 
dependent  variable  is  the  total  number  of  words  that  can  be  read 
correctly  (online  performance)  or  the  number  of  words  learned. 
During  the  present  experiment,  the  same  children  were  substan¬ 
tially  more  successful  when  presented  with  words  in  context 
compared  with  in  isolation.  However,  because  this  benefit  was 
apparent  on  the  first  trial,  the  rate  of  learning  did  not  differ 
between  the  context  and  the  isolation  conditions.  Therefore,  the 
condition  that  results  in  superior  reading  is  a  matter  of  opinion.  On 
the  one  hand,  it  could  be  argued  that  children  learned  just  as  many 
words  in  isolation  as  they  did  in  context.  On  the  other  hand,  it 
could  also  be  said  that  presenting  words  in  contexts  allows  chil¬ 
dren  to  experience  greater  success  while  reading  from  the  first 
session  onward.  Both  arguments  are  equally  valid  and  supported 
by  the  data. 

Second,  the  role  of  context  is  most  influential  when  reading  is 
difficult.  Therefore,  the  nature  of  the  materials  is  an  important 
factor  with  regards  to  context  effects.  As  noted  in  the  introduction, 
several  studies  in  this  area  have  used  very  small  training  sets  (e.g., 
eight  items)  and  words  that  can  be  easily  decoded  (e.g.,  CVCe 
words  such  as  “yate”).  When  decoding  is  relatively  simple  it  is 
unlikely  that  children  would  need  the  extra  support  provided  by 
context.  It  is  hypothesized  that  using  a  much  larger  word  set — with 
words  that  varied  greatly  in  letter  length  and  morphological  com¬ 
plexity — made  the  present  study  more  sensitive  to  change,  and 
allowed  the  advantages  of  context  to  be  observed. 

A  final  consideration  is  whether  spelling  or  reading  is  the 
variable  of  interest.  The  current  experiment  focused  on  self¬ 
teaching  in  relation  to  word  reading  rather  than  spelling.  However, 
there  could  be  any  number  of  different  mappings  between  the 
underlying  orthographic  representation  and  the  way  a  word  is 
ultimately  read.  The  child  could  read  the  word  correctly  and  have 
a  high  quality  representation;  conversely,  the  child  could  read  the 
word  correctly  and  have  a  low  quality  or  incomplete  orthographic 
representation  (Martin-Chang,  Ouellette,  &  Madden,  2014). 
Therefore,  the  current  findings  cannot  comment  on  whether  the 
benefits  associated  with  contextual  reading  would  generalize  to 
spelling.  In  fact,  there  is  reason  to  believe  that  they  may  not.  For 
example,  Kyte  and  Johnson  (2006)  found  that  reading  accuracy 
during  their  learning  phase  was  not  equally  correlated  with  all  of 
the  tasks  that  comprised  the  orthographic  learning  task.  Rather,  it 
was  most  closely  linked  to  the  reading  accuracy  posttest  measure 
and  least  linked  with  the  orthographic  choice  task.  Given  the 
wealth  of  evidence  showing  that  context  does  not  aid  (nor  hinder) 
orthographic  learning,  it  could  be  possible  that  the  two  factors 
involved  in  self-teaching  (decoding  and  whole-word  phonology) 
have  differential  effects  on  reading  and  spelling  development. 
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Grapheme-to-phoneme  decoding  may  have  a  more  profound  im¬ 
pact  on  spelling  development  (suggesting  higher  quality  ortho¬ 
graphic  representations  in  memory),  while  matching  whole-word 
orthography  to  phonology  might  suffice  for  reading  accuracy  but 
not  precise  spelling  (suggesting  representations  that  may  not  be 
fully  specified).  It  is  also  important  to  consider  that  a  child  could 
conceivably  read  a  word  incorrectly  (e.g.,  read  the  word  “chaos”  as 
“chase”  or  “city”  as  “kitty”)  and  still  have  formed  a  stable  ortho¬ 
graphic  representation  of  that  word  in  memory,  albeit  one  that  is 
associated  with  an  incorrect  pronunciation. 

Limitations  and  Future  Directions 

This  study  offers  a  different  perspective  on  self-teaching  be¬ 
cause  it  observed  how  children  learn  to  read  a  large  bank  of  real 
words  over  multiple  exposures.  In  many  respects  the  materials 
were  representative  of  those  children  encounter  when  they  read 
independently,  specifically,  they  contained  many  words  of  varying 
difficulty  to  be  learned.  However,  using  real  words  made  it  diffi¬ 
cult  to  determine  the  proportion  of  items  that  were  self-taught 
during  the  first  session  compared  with  those  that  were  previously 
known.  Therefore,  it  would  be  fruitful  to  replicate  the  procedures 
discussed  here  with  words  that  were  known  verbally,  but  not  in 
writing,  to  determine  if  the  same  patterns  hold.  In  addition,  the 
ability  to  arrive  at  a  correct  pronunciation  without  a  high  quality 
representation  to  support  reading  is  partially  driven  by  the  prop¬ 
erties  of  the  words  themselves.  Therefore,  examining  the  interac¬ 
tion  between  context  and  word  regularity  may  be  insightful.  Future 
studies  should  consider  using  a  cross-classified  analytic  approach 
(see  Kim,  Petscher,  Foorman,  &  Zhou,  2010)  to  examine  the 
contribution  of  both  participant  characteristics  and  word  features 
simultaneously. 

A  second  limitation  involves  the  shared  reading  paradigm  used 
during  this  study.  Shared  reading,  while  offering  a  high  degree  of 
experimental  control,  does  not  mimic  the  independent  reading  that 
generally  happens  when  children  self-teach.  This  manner  of  pre¬ 
sentation  might  have  felt  awkward  for  good  readers,  who  could 
have  read  the  whole  text  alone.  It  may  also  have  maximized  the 
effects  of  context  for  poor  readers,  who  might  have  struggled  to 
read  the  surrounding  text  independently.  Therefore,  this  study 
should  be  replicated  when  children  are  reading  independently. 

Conclusions 

Orthographic  learning  has  been  attributed  with  providing  read¬ 
ers  with  fast,  accurate,  and  long  lasting  access  to  written  words;  in 
short,  self-teaching  is  the  mechanism  that  has  been  attributed  with 
improving  reading  fluency.  The  results  from  the  present  experi¬ 
ment  showed  that  children  learned  to  read  a  number  of  words 
without  feedback  in  both  context  and  in  isolation,  but  that  im¬ 
provement  during  training  was  more  pronounced  (and  equally 
lasting)  when  feedback  was  provided.  The  data  also  showed  that 
children  could  read  substantially  more  words  accurately  when  they 
were  presented  with  words  in  context  compared  with  in  isolation. 
The  most  advantageous  of  all  the  conditions  was  when  children 
were  given  the  benefits  of  both  context  and  feedback. 

The  conclusions  drawn  from  the  present  study  suggest  that 
generating  (with  the  help  of  context)  or  being  given  a  word’s 
pronunciation  (via  feedback)  aids  in  accurate  decoding  on  future 


word  encounters.  It  would  seem  that  children  use  all  available 
resources,  including  feedback  and  context,  to  amalgamate  the 
orthographic,  phonological,  and  semantic  properties  of  words  as 
they  are  learning  to  read. 
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Appendix  A 

Stimuli  Lists  and  Word  Frequency  Counts  Per  Million  Words 


List  1 

Frequency 

List  2 

Frequency 

List  3 

Frequency 

List  4 

Frequency 

After 

682.59 

About 

363-1.49 

Admired 

3.67 

Away 

730.9 

Ahead 

198.33 

Above 

48.88 

Announced 

5.67 

Awful 

63.41 

Animals 

39.08 

Again 

792.71 

Arms 

59.8 

Bald 

9.73 

Behind 

187.86 

Always 

655.25 

Around 

736.73 

Because 

1071.02 

Below 

28.04 

Answered 

15.2 

Asked 

216.25 

Belt 

24.35 

Beneath 

11.63 

Beating 

22.75 

Baby 

509.37  s 

Boots 

19.16 

Board 

64.16 

Before 

794.14 

Back 

2009.16 

Boys 

224.16 

Both 

295.33 

Began 

32.51 

Bath 

31.12 

Branch 

10.08 

Bounced 

2.96 

Bird 

45.45 

Bathroom 

61.67 

Breeze 

8.04 

Bridge 

45.71 

Blackbird 

.82 

Bathtub 

6.1 

Broke 

105 

Cars 

45.63 

Books 

67.76 

Bedroom 

36.71 

Called 

340.02 

Chuckled 

.08 

Bright 

44.41 

Best 

404.37 

Camera 

57 

Close 

219.43 

Child 

157.65 

Bowls 

2.18 

Cannon 

8.71 

Coming 

527.02 

Chirping 

.98 

Brand 

13.96 

Caught 

93.94 

Could 

1629.59 

Clever 

27.27 

Carried 

20.12 

Check 

278.98 

Crash 

28.65 

Confessed 

5.96 

Change 

240.35 

Climb 

19.75 

Cross 

55.04 

Cried 

12.98 

Clean 

121.24 

Couldn’t 

N/A 

Deer 

8.71 

Demanded 

2.76 

Closely 

9.18 

Crack 

32.84 

Drove 

28.86 

Different 

209.53 

Continue 

49.55 

Crows 

2.76 

Elizabeth 

N/A 

Discovered 

28.76 

Crowded 

8.94 

Dead 

448.98 

Excited 

48.61 

Each 

253.25 

Declared 

6.53 

Decided 

88.65 

Fallen 

16.92 

Estimate 

4.76 

Door 

292.06 

Diving 

6.29 

Families 

22.33 

Ever 

709.22 

Enough 

501.33 

Down 

1490.3 

Fast 

137.45 

Every 

549.16 

Even 

875.92 

Edge 

23.51 

Father 

554.49 

Everything 

654.88 

Everyone 

241.65 

Feet 

120.73 

Filled 

27.18 

Forest 

18.88 

Family 

354.25 

Fell 

73 

Followed 

34.1 

Found 

396 

Felt 

119.82 

Firecrackers 

.71 

Forward 

72.33 

Freeze 

32.16 

First 

840.57 

Five 

285.45 

Foxes 

1.16 

Guess 

453.98 

Good 

2610.14 

Flapping 

1.67 

Girls 

208.35 

Heart 

244.18 

Grandma 

N/A 

Fluttered 

.06 

Glanced 

.63 

Here 

4525.25 

Hall 

51.94 

Foot 

64.92 

Going 

2123.29 

Home 

774.33 

Just 

4749.14 

Getting 

484.69 

Group 

73.76 

Huge 

48.37 

Kept 

89.39 

Giggling 

4.1 

Growled 

.12 

Imitate 

1.8 

Laughed 

10.69 

Grinned 

.24 

Hands 

236.53 

Kind 

590.69 

Lots 

60.16 

Halfway 

13.29 

Happy 

333.2 

Knew 

368.96 

Louder 

10.1 

Head 

371.51 

Hated 

28.22 

Know 

5721.18 

Made 

561.29 

Help 

921.12 

Held 

42.45 

Languages 

4.1 

Making 

222.53 

High 

195 

Hello 

N/A 

Leaving 

141.39 

Metal 

19.45 

Hurry 

173.65 

Hike 

6.53 

Listening 

62.84 

Mind 

484.61 

Looked 

120.9 

Inched 

.02 

Many 

359.43 

Modern 

18.24 

Make 

1387.75 

Included 

7.49 

Maple 

3.24 

More 

1298.59 

Mice 

6.57 

Jumped 

21.14 

Minute 

377.49 

Never 

1362.55 

Mayer 

N/A 

Laughing 

52.29 

Myself 

342.55 

Nothing 

853.61 

Moment 

187.04 

Little 

1446.39 

Name 

641.86 

Ordered 

36.96 

Much 

973.25 

Lived 

66.04 

Nearby 

8.33 

Papa 

N/A 

Mustn’t 

N/A 

Long 

675.16 

Nimbly 

.1 

Picked 

69.29 

Nest 

11.1 

Moved 

69.33 

Nobody 

266.65 

Pitcher 

3.24 

Next 

452.75 

Named 

69.88 

Notice 

59.25 

Plastic 

18.76 

Only 

1083.71 

Need 

1294.9 

Often 

57.35 

Politely 

1.71 

Owls 

2.12 

Noises 

7.16 

Once 

344.88 

Pulling 

27.14 

Park 

72.12 

Nuts 

53.51 

Other 

735.39 

Quickly 

56.49 

Place 

602.67 

Onto 

36.69 

Professor 

69.57 

Screams 

16.9 

Quietly 

12.33 

Others 

99.24 

Realized 

35.96 

Seems 

167.55 

Ready 

387.8 

Popped 

7.92 

Right 

4008.39 

Shall 

185.12 

Realize 

79.06 

Possible 

114.04 

School 

333.12 

Shiny 

7.8 

Returned 

24.76 

Rabbits 

6.43 

Single 

72.08 

Should 

1061.94 

Scare 

33  57 

Replied 

1.16 

Size 

46.14 

Smiled 

4.92 

Screeching 

2.55 
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List  1 

Frequency 

List  2 

Frequency 

List  3 

Frequency 

List  4 

Frequency 

Riverbed 

.43 

Slid 

1.84 

Soapsuds 

N/A 

Seen 

384.96 

Rotten 

17.47 

Snickered 

.02 

Splashing 

1.1 

Shots 

28.37 

Seats 

21.76 

Some 

1727.24 

Spouts 

.06 

Sign 

133.27 

Seemed 

54.25 

Song 

93.69 

Started 

187.57 

Sitting 

94.39 

Shrill 

.47 

Sound 

143.39 

Stepped 

12.86 

Small 

124.96 

Slowly 

25.08 

Stopped 

75.37 

Stick 

97.12 

Softly 

4.73 

Something 

1500.16 

Strange 

86.43 

Stood 

25.78 

Sounded 

18.86 

Sounds 

156.27 

Study 

49.04 

Stop 

707.27 

Sticks 

13.61 

Stacey’s 

N/A 

Suddenly 

55.96 

Sweater 

13.8 

Such 

291.22 

Start 

340.1 

Things 

692.88 

Thank 

1115.24 

Sure 

1099.82 

Stay 

515.65 

Those 

753.02 

Thick 

13.98 

Take 

1891.04 

Still 

788.73 

Thought 

808.47 

Though 

181.94 

Teacher 

55.73 

Story 

220.78 

Thousands 

27.65 

Towel 

14.16 

Teddy 

N/A 

Sudden 

33.47 

Town 

247.92 

Trouble 

223.55 

These 

904 

Swamps 

.88 

Tracks 

16.75 

Turned 

105.65 

Think 

2691.39 

Tested 

10.53 

Tried 

186.84 

Twisted 

10.59 

Tiptoe 

.88 

Their 

655.16 

Unknown 

15.18 

Waiting 

211.12 

Told 

699.59 

Today 

433.8 

Ventured 

.47 

Want 

2759.18 

Took 

342.24 

Together 

383.39 

Walk 

215.86 

Warm 

52.14 

Towards 

27.43 

Trail 

19.2 

Went 

411.51 

Warned 

15.84 

Tree 

65 

Troll 

2.71 

Whistle 

15.45 

Water 

225.06 

Trying 

448.02 

Unsure 

1.02 

Whose 

62.49 

Whimpered 

.08 

Underfoot 

N/A 

Very 

1241.25 

Wildly 

1.92 

Whined 

.1 

Under 

261.92 

Voice 

86.16 

Woods 

29.06 

Will 

2123.65 

Warning 

31.96 

Who’s 

N/A 

Would 

1767.88 

Wooden 

7.2 

Whispered 

2.02 

Wondered 

14.9 

Written 

44.06 

World 

455.22 

Without 

354.65 

Wonderful 

164 

Young 

243.18 

Years 

568.69 

Yelled 

6.14 

Average 

231.88 

443.38 

377.65 

291.04 
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Appendix  B 

Example  of  a  Training  Story  “Mary’s  Race  Day” 


(Target  words  are  those  bolded  and  underlined  in  the  story) 

The  thick  door  flung  open  and  the  children  ran  outside  onto 
Barton  Street.  They  were  all  looking  to  see  the  shiny  new  wagon 
with  its  modern  metal  frame.  Mary’s  was  the  best  wagon  in 
years.  The  children  admired  its  fancy  plastic  steering  wheel  and 
its  brand  new  wooden  basket.  They  thought  it  was  the  fastest 
wagon  in  the  world.  The  children  laughed  and  yelled  louder  as 
more  wagons  crowded  onto  Barton  Street. 

Mary  could  hear  the  children’s  screams  from  her  bedroom.  She 
didn’t  want  to  keep  them  waiting.  She  quickly  rushed  through  the 
hall  into  the  bathroom  for  a  bath.  When  Mary  finished  the  bath. 
her  splashing  had  left  water  and  soapsuds  all  over  the  bathtub. 
She  picked  up  a  towel  and  ran  it  over  the  spouts  and  bowls.  She 
stepped  back  and  admired  the  shiny  bathtub  and  spouts. 

“Finished,”  she  declared. 

She  left  the  bathroom  and  went  into  her  bedroom.  She  smiled 
while  pulling  on  her  warm  thick  sweater.  Today  was  Mary’s  first 
wagon  race.  Nothing  could  change  her  wonderful  mood.  Mary 
had  never  felt  more  prepared  for  anything.  She  stepped  back  into 
the  hall.  Her  towel  was  full  of  soapsuds  so  she  carried  it  to  the 
plastic  bowls.  Then  she  went  downstairs.  In  the  kitchen  she  saw 
that  her  father  had  picked  lots  of  lemons.  He  twisted  and  turned 
them  until  he  made  a  baby  pitcher  of  lemonade.  Papa  thought  the 
old  fashioned  lemonade  was  better  than  the  new  modern  stuff 
from  a  can.  Mary  loved  it.  She  often  drank  a  whole  baby  pitcher 
herself.  But  today  she  politely  declared  that  she  was  full  and  that 
she  should  stop.  Mary  was  careful  to  mind  her  manners  and 
thank  her  father.  She  knew  she  had  the  best  family.  Mary 
wouldn’t  trade  them  for  the  world.  Soon  everyone  was  set  to  go 
so  they  left  the  house  and  locked  the  front  door. 

On  the  way  to  the  race,  Grandma  asked  Mary  if  she  should 
continue  to  stretch  her  arms  and  run  around  the  block  to  warm 
up  her  legs.  Mary  smiled.  Her  Grandma  never  ordered  to  do 
things.  She  just  politely  suggested  them. 

“That  seems  like  a  good  idea,”  Mary  announced.  “I  think  I 
shall  try  it.” 


But  when  Mary  started  down  the  street  she  wanted  to  change 
her  mind.  There  were  lots  of  people  yelling  for  her  and  they  were 
louder  than  she  could  have  imagined. 

“Oh  no,”  Mary  whined.  As  her  family  began  going  towards  the 
screams  and  crowded  streets  Mary’s  eyes  began  to  water. 

“I’m  Scared,”  she  whined. 

“I  know  this  seems  difficult,  but  I  have  years  of  experience,” 
her  Papa  warned  “You  will  continue  on  alone  but  we  shall 
follow  closely  behind.” 

Mary  Ml  like  pulling  on  his  arms  and  making  him  come  with 
her  but  she  knew  that  she  was  old  enough  to  go  alone. 

Even  though  she  was  frightened  Mary  kept  her  head  up  as  she 
stood  by  her  wagon.  Her  heart  whimpered  with  fear  as  they  were 
ordered  to  get  into  their  wagons.  But  she  wasn’t  waiting  long. 
Soon  the  gun  fired  and  the  race  had  started.  Mary  made  a  clean 
getaway  and  she  was  quickly  in  first  place.  But  then  Tom,  another 
racer,  scooted  closely  past  her.  Tom  also  had  a  brand  new  metal 
wagon.  He  was  making  this  an  exciting  race.  As  the  road  twisted 
down  the  hill,  Mary  spotted  trouble.  There  was  a  large  wooden 
stick  in  the  middle  of  the  road  but  there  was  nothing  she  could  do. 
Tom  could  not  be  warned.  He  just  kept  racing  towards  trouble. 
His  wagon  came  to  a  sudden  stop  as  it  hit  the  stick.  Tom  fell  hard, 
splashing  into  a  puddle.  Some  of  the  other  kids  laughed  as  they 
passed  him  but  Mary  didn’t  want  to  leave  him  there.  She  turned 
her  wagon  around  and  carried  him  off  the  road.  She  gave  him  her 
sweater  and  helped  him  clean  his  cuts.  Tom  could  not  thank  her 
enough. 

“Will  you  be  alright?”  Mary  asked. 

“Yes,”  he  whimpered,  “thanks  to  you.” 

At  the  end  of  the  day,  even  though  Mary  lost  the  race,  her  good 
deed  was  announced  and  everyone  stood  and  clapped  for  her. 
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The  Role  of  the  Updating  Function  in  Solving  Arithmetic  Word  Problems 


Kanetaka  Mori  and  Masahiko  Okamoto 

Osaka  Prefecture  University 


We  investigated  how  the  updating  function  supports  the  integration  process  in  solving  arithmetic  word 
problems.  In  Experiment  1 ,  we  measured  reading  time,  that  is,  translation  and  integration  times,  when 
undergraduate  and  graduate  students  ( n  —  78)  were  asked  to  solve  2  types  of  problems:  those  containing 
only  necessary  information  and  those  containing  extraneous  information.  The  results  indicated  that 
participants  required  more  integration  time  to  solve  extraneous-information  problems  than  necessary- 
information  problems.  However,  the  higher  the  updating,  the  smaller  the  increment  of  integration  time. 
In  Experiment  2,  we  investigated  whether  different  problem  models  were  provided  by  undergraduate  and 
graduate  students  ( n  =  73)  with  different  updating  functions.  Participants  executed  a  lexical-decision  task 
immediately  following  an  integration  process.  The  lexical-decision  task  comprised  3  conditions: 
necessary-information  word,  extraneous-information  word,  and  novel  word  conditions.  The  RTs  for  both 
necessary-  and  extraneous-information  word  conditions  were  faster  than  that  for  the  novel  word 
condition.  The  facilitation  amount  in  an  extraneous-information  word  became  weaker  as  the  problem 
solver’s  updating  function  increased.  These  results  suggest  that  individuals  with  a  high  updating  function 
provide  a  problem  model  that  maintains  only  task-relevant  information,  while  those  with  less-effective 
updating  use  an  approach  that  also  considers  extraneous  information.  These  2  experiments  indicate  that 
updating  is  an  important  contributor  to  the  integration  process  and  different  updating  abilities  result  in 
different  problem  models. 

Keywords:  arithmetic  word  problems,  integration,  updating,  problem  model 


Understanding  cognitive  processes  underlying  arithmetic  word 
problems  enables  us  to  develop  better  instructional  designs.  Re¬ 
cently,  researchers  have  investigated  the  relationship  between 
working  memory  and  arithmetic  or  mathematics.  Arithmetic  word 
problem  solving  requires  active  manipulation  of  relevant  informa¬ 
tion.  Working  memory  is  used  as  a  mental  space  for  all  cognitive 
activities  including  word  problem  solving.  According  to  Badde- 
ley’s  (1986)  three-component  model,  working  memory  involves 
two  subsystems:  a  phonological  loop,  which  stores  verbal  materi¬ 
als,  and  a  visuospatial  sketchpad,  which  houses  visual  information. 
These  two  components  are  coordinated  by  the  central  executive,  a 
supervisory  system  concerned  with  attention.  Several  studies  have 
revealed  the  importance  of  the  central  executive  for  arithmetic 
word  problem  solving  (Andersson,  2007;  Fuchs  et  al.,  2010;  Lee, 
Ng,  Ng,  &  Lim,  2004;  Swanson,  2006;  Swanson  &  Sachse-Lee, 
2001).  Lee  et  al.  (2004)  investigated  the  relationships  among 
working  memory,  literacy,  performance  IQ,  and  arithmetic  word 
problem  performance.  They  revealed  that  the  phonological  loop 
and  visuospatial  sketchpad  contribute  to  arithmetic  word  problem 
performance  via  literacy  and  performance  IQ,  respectively.  They 
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also  found  that  the  central  executive  contributes  both  directly  and 
indirectly  to  arithmetic  word  problem  performance.  Arithmetic 
word  problem  solving  requires  problem  solvers  to  not  only  main¬ 
tain  incoming  information  using  the  phonological  loop  and  visu¬ 
ospatial  sketchpad  but  also  control  the  information  through  the 
central  executive. 

Although  the  central  executive  contributes  to  arithmetic  word 
problem  solving,  we  do  not  yet  understand  how  this  system  sup¬ 
ports  problem  solving.  If  we  know  more  about  the  relationship 
between  word  problem  solving  and  the  central  executive,  we  can 
gain  implications  for  what  cognitive  process  or  function  we  should 
teach.  In  brief,  we  need  to  know  how  the  central  executive  func¬ 
tions  in  solving  arithmetic  word  problems. 

Some  researchers  have  suggested  that  the  central  executive  has 
a  range  of  functions  (Baddeley,  1996;  Miyake  et  al.,  2000).  Mi¬ 
yake  et  al.  (2000)  provided  evidence  that  the  central  executive’s 
functions  include  at  least  three  unity  and  diversity  functions, 
described  as  inhibiting,  shifting,  and  updating.  Inhibiting  is  the 
ability  to  inhibit  dominant,  automatic,  and  prepotent  responses. 
Shifting  is  the  ability  to  switch  back  and  forth  flexibly  between 
tasks  or  mental  sets.  Updating  is  the  ability  to  monitor  incoming 
information  for  relevance  to  the  task  at  hand  and  then  appropri¬ 
ately  update  by  replacing  old,  no  longer  relevant  information  with 
new,  more  relevant  information.  Furthermore,  these  functions 
overlap.  That  is,  updating  can  include  an  inhibitory  process.  In 
fact,  Miyake  and  Friedman  (2012)  reported  that  updating  consists 
of  a  common  executive  function  that  substitutes  for  inhibition  and 
the  updating  specific  process.  Although  there  is  some  discussion 
on  this  point,  we  follow  Miyake  et  al.’s  (2000)  definition  to  relate 
our  study  to  previous  research. 
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Agostino,  Johnson,  and  Pascual-Leone  (2010)  investigated  the 
relationships  among  the  three  executive  functions,  mental  or 
M-capacity,  and  performance  on  arithmetic  word  problems.  Results 
of  structural  equation  modeling  indicated  that  the  updating  function 
was  the  best  predictor  of  accuracy  in  arithmetic  word  problems. 
Furthermore,  Passolunghi  and  Pazzaglia  (2005)  demonstrated  that 
children  who  achieve  better  mathematical  performance  show  higher 
updating  performance.  These  results  indicate  that  the  updating  func¬ 
tion  relates  to  solving  arithmetic  word  problems.  Although  investiga¬ 
tion  of  the  relationship  between  working  memory  and  arithmetic  word 
problem  performance  revealed  that  updating  is  important  in  solving 
arithmetic  word  problems,  its  role  in  solving  arithmetic  word  prob¬ 
lems  remains  unclear.  To  understand  the  arithmetic  word  problem¬ 
solving  process  more  clearly,  we  need  to  determine  the  amount  of 
updating  required  and  the  phase  of  problem  solving  influenced  by 
updating. 

Process  of  Arithmetic  Word  Problem  Solving 

Mayer’s  model  (Mayer  &  Hegarty,  1996)  and  Polya’s  (1957) 
model  suggest  that  the  most  difficult  and  important  phase  for  indi¬ 
vidual  problem  solvers  is  to  understand  a  problem.  This  comprehen¬ 
sion  phase  can  be  divided  into  translation  and  integration  phases 
(Kintsch,  1992;  Mayer  &  Hegarty,  1996).  In  the  translation  phase, 
problem  solvers  need  to  translate  each  statement  in  a  problem  into  a 
mental  representation.  In  the  integration  phase,  problem  solvers  must 
integrate  each  representation  into  a  problem  model.  Problem  solvers 
use  numerical  expressions  to  solve  the  problem,  based  on  the  problem 
model  employed.  The  process  that  results  in  the  formation  of  a 
particular  model  is  important  because  the  solution  strategy  depends  on 
it. 

Muth  (1984,  1992)  demonstrated  that  word  problem-solving  per¬ 
formance  is  influenced  by  the  difficulty  of  integration.  She  assigned  a 
problem-solving  task  in  which  the  problem  included  extraneous  in¬ 
formation  (not  required  to  solve  the  problem).  Her  experiment 
showed  that  a  child  who  could  solve  a  problem  with  no  extraneous 
information  was  likely  to  fail  in  solving  an  extraneous-information 
problem.  Although  extraneous  information  did  not  change  the  syn¬ 
tactic  complexity  or  the  expression  of  a  problem,  it  increased  demands 
related  to  integration  of  information.  Muth  (1984,  1992)  also  revealed 
that  the  integration  process  is  important  and  extraneous  information 
increases  the  difficulty  in  the  process,  but  did  not  indicate  why  this 
effect  occurs  and  what  cognitive  function  relates  to  a  successful 
integration  process.  Thus,  we  need  to  investigate  the  cognitive  func¬ 
tion  underlying  the  integration  process. 

Hegarty,  Mayer,  and  Green  (1992)  and  Okamoto  (1999)  exam¬ 
ined  the  nature  of  word  problems  that  are  challenging  for  children 
and  adults,  by  looking  at  the  reading  times  required  for  integration. 
They  found  that  both  children  and  adults  require  more  time  for 
integration  when  solving  extraneous-information  problems.  These 
results  suggest  that  extraneous-information  problems  and  measure¬ 
ment  of  reading  times  are  useful  for  examining  the  integration 
phase  in  solving  word  problems. 

Relationship  Between  Integration  and  the 
Updating  Function 

An  important  assumption  generated  from  research  into  the  re¬ 
lationship  between  working  memory  and  word  problem  solving  is 


that  the  updating  function  may  play  a  critical  role  in  the  integration 
process.  In  the  integration  process,  one  needs  to  continually  inte¬ 
grate  incoming  information  to  form  a  problem  model.  When  new 
information  enters  working  memory,  it  changes  the  model  and  the 
old  problem  model  is  replaced  by  the  newer  one.  This  means  that 
problem  solvers  might  have  to  update  the  nature  of  their  problem 
model  sequentially  when  they  are  carrying  out  a  word  problem¬ 
solving  task.  Updating  failure  results  in  an  inappropriate  problem 
model  and  an  incorrect  answer.  Therefore,  we  hypothesize  that 
updating  in  integration  is  crucial  for  word  problem  solving. 

Evidence  to  support  this  prediction  comes  from  Kotsopoulos 
and  Lee  (2012),  who  examined  the  phase  and  nature  of  errors 
occurring  in  word  problem  solving.  They  video-recorded  students 
talking  aloud  while  completing  homework.  Their  speech  coding 
identified  which  of  the  four  problem-solving  phases  appeared 
problematic  for  students  and  which  singular  executive  function 
played  the  most  important  role  in  explaining  their  challenges. 
Updating  challenges  occurred  when  students  were  having  difficul¬ 
ties  evaluating  information  and  appropriately  editing  it  according 
to  more  relevant  information  in  ways  that  would  allow  them  to 
proceed  to  the  next  problem-solving  phase.  This  research  revealed 
that  most  errors  occur  in  the  integration  process,  and  were  caused 
by  the  updating  function  in  integration.  Integration  might  be  a 
main  process  in  solving  word  problems,  and  updating  is  a  key 
cognitive  function  in  this  process.  However,  these  results  were 
obtained  from  subjects’  verbal  reports;  this  indicates  that  underly¬ 
ing  cognitive  processes  are  not  clearly  reflected.  There  is  no  direct 
evidence  that  their  errors  were  due  to  the  updating  function, 
because  Kotsopoulos  and  Lee  did  not  objectively  measure  stu¬ 
dents’  updating  function.  Thus,  the  relationship  between  the  up¬ 
dating  function  and  the  integration  process  needs  to  be  further 
investigated  empirically. 

Present  Study 

The  purpose  of  this  study  was  to  reveal  the  role  of  the  updating 
function  in  integration  when  solving  an  arithmetic  word  problem. 
Investigating  this  role  would  provide  insights  into  how  word 
problems  are  solved  and  what  process  we  must  carefully  teach 
children  who  have  lower  updating. 

If  the  updating  function  aids  an  efficient  integration  process,  a 
problem  solver  who  has  higher  updating  function  might  be  slightly 
influenced  by  the  difficulty  of  integration,  although  extraneous 
information  could  increase  the  difficulty  of  integration  (Muth, 
1984,  1992).  To  explore  the  relationship  between  the  updating 
function  and  the  integration  process,  we  measured  reading 
times  when  undergraduate  students  were  solving  necessary-  and 
extraneous-information  problems. 

The  contribution  of  working  memory  to  arithmetic  word  prob¬ 
lems  is  not  stable  in  elementary  school  students  (Rasmussen  & 
Bisanz,  2005;  Meyer,  Salimpoor,  Wti,  Geary,  &  Menon,  2010). 
Moreover,  if  problem  solvers  do  not  acquire  language  skills,  they 
cannot  solve  arithmetic  word  problems,  and  so  we  cannot  examine 
the  contribution  of  working  memory  to  arithmetic  word  problem 
solving.  Because  working  memory  and  language  skills  are  con¬ 
founded  in  children,  we  used  undergraduate  students  with  stable 
working  memory,  sufficient  language  skills,  and  related  problem¬ 
solving  schema. 
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Our  predictions  were  as  follows:  (a)  An  extraneous-information 
problem  requires  more  integration  time  than  a  necessary- 
information  problem,  (b)  the  effect  of  extraneous  information  on 
integration  time  is  smaller  in  higher  updating  problem  solvers. 

Experiment  1 

Participants 

Participants  were  78  undergraduate  students  (42  males)  in  Japan 
with  an  average  age  of  19.54  years.  Most  participants  were  re¬ 
cruited  from  lectures  of  introduction  to  psychology  in  2012  and 
2015.  They  participated  voluntarily,  but  were  compensated  with 
course  credits  or  a  book  coupon  for  500  yen.  Some  participants 
were  recruited  personally  from  acquaintances,  who  were  not  very 
familiar  to  the  researcher.  All  participants  were  native  Japanese 
speakers  and  were  enrolled  in  the  following  courses:  Social  Sci¬ 
ence,  Human  Science,  Engineering,  Science,  and  Health  Science. 
Math-related  courses  were  Engineering  and  Science.  Participants 
in  other  courses  also  had  sufficient  mathematical  achievement,  as 
demonstrated  by  their  passing  of  the  National  Center  Test  for 
University  Admissions  in  Japan.  Sample  problems  of  this  test  were 
reported  in  Wu  (1993).  Mathematics  in  this  test  includes  vectors 
and  functions. 

Apparatus 

Stimulus  presentation  and  response  recording  were  controlled 
by  Matlab  with  the  Psychophysics  Toolbox  extensions  (Brainard, 
1997;  Pelli,  1997;  Kleiner  et  al.,  2007)  on  an  Apple  iMac  21.5-inch 
display  and  an  original  USB  response  key  box  with  three  buttons. 

Tasks  and  Procedures 

Arithmetic  word  problem-solving  task.  Participants  were 
asked  to  read  a  word  problem  and  select  a  formula  to  solve  it. 
Sentences  were  displayed  one  by  one.  Participants  could  manipu¬ 
late  the  displayed  sentence,  moving  backward  or  forward  by  press¬ 
ing  the  left  or  right  key  on  the  response  box.  They  were  required 
to  press  the  center  key  once  they  had  comprehended  the  problem. 
After  pressing  this  key,  three  expressions  appeared,  and  partici¬ 
pants  selected  the  correct  expression. 

To  analyze  the  comprehension  process  in  arithmetic  word 
problem  solving,  we  calculated  three  types  of  reading  times  (see 
Figure  1).  The  time  participants  took  to  read  a  word  problem 
sentence  until  expressions  were  displayed  was  the  “whole  read¬ 
ing”  time.  Time  taken  from  onset  of  a  problem  to  the  end  of  the 


Figure  1.  Example  of  three  reading  times.  Sent,  and  Ques.  each  indicate 
a  sentence  in  a  problem.  Ques.  indicates  an  interrogative  statement. 


question  statement’s  presentation  was  the  “translation”  time. 
Time  taken  from  the  question  statement’s  initial  presentation  to 
the  expressions’  presentation  was  the  “integration”  time.  This 
reflected  the  integration  process  in  solving  a  word  problem. 
Hegarty  et  al.  (1992)  and  Okamoto  (1999)  measured  the  time 
from  a  question  statement’s  presentation  to  selecting  a  numer¬ 
ical  expression  to  solve  a  problem.  This  reflects  not  only  the 
integration  process  but  also  the  planning  process.  However,  we 
must  distinguish  integration  time  from  planning  time.  Thus,  in 
this  study,  integration  time  was  defined  as  the  time  from  the 
onset  of  question  presentation  to  the  end  of  reading  a  problem. 
Planning  time  was  defined  as  the  reaction  time  (RT)  needed  to 
select  the  correct  expression. 

We  prepared  two  types  of  arithmetic  word  problems: 
necessary-information  and  extraneous-information  problems. 
Table  1  shows  an  example  of  each.  A  necessary-information 
problem  consisted  of  three  sentences — all  needed  to  form  an 
expression.  The  extraneous-information  problem  contained  four 
sentences — one  of  which  was  unnecessary  for  forming  an  ex¬ 
pression.  We  used  16  arithmetic  word  problems.  Half  of  the 
problems  were  necessary-information  problems  and  the  others 
were  extraneous-information  problems.  Four  problems  were 
used  for  practice  trials. 

Updating  tasks.  We  used  two  updating  tasks:  (a)  a  letter 
memory  task  (adapted  from  Miyake  et  al.,  2000)  for  the  pho¬ 
nological  domain,  and  (b)  a  visual  rc-back  task  (adapted  from 
Agostino  et  al.,  2010)  for  the  visual  domain. 

The  letter  memory  task  measured  phonological  updating  in 
working  memory.  In  this  task,  letters  were  presented  serially  on  a 
computer  screen  at  a  rate  of  2,000  ms  per  letter.  Participants  were 
required  to  rehearse  aloud  the  last  four  letters  throughout  by 
dropping  the  fifth  letter  and  adding  the  most  recent  one.  We 
recorded  participants’  verbal  recall.  The  numbers  of  letters  in  the 
stimulus  sets  were  5,  7,  9,  and  11.  Each  set  was  presented  four 
times,  and  each  participant  completed  16  trials.  If  participants 
could  accurately  name  aloud  all  letters  in  each  set,  the  trial  was 
classified  as  correct.  The  phonological  updating  score  was  the 
percentage  of  correct  trials. 

The  visual  n-back  task  measured  visual  updating  of  working 
memory.  Three  dots  were  presented  on  a  computer  screen.  Partic¬ 
ipants  were  required  to  decide  whether  the  current  pattern  was  the 
same  as  the  pattern  n  trials  before.  Their  decisions  were  recorded 
by  pressing  a  key  that  changed  the  presentation  pattern.  In  each 
n-back  task,  there  were  54  test  trials  following  12  practice  trials. 
We  used  20  matched  test  trials  and  34  mismatched  trials.  We  used 
the  1-back  and  2-back  tasks.  Each  score  was  calculated  by  the 
following  formula:  (proportion  of  correct  match  +  proportion  of 
correct  mismatch). 

General  Procedures 

Participants  were  asked  to  perform  the  arithmetic  word  problem 
task  and  then  two  updating  tasks.  The  order  of  the  phonological 
and  visual  updating  tasks  was  counterbalanced  across  participants. 
Participants  received  instructions,  which  were  displayed  on  the 
monitor,  prior  to  undertaking  each  task,  and  they  could  ask  the 
experimenter  any  question.  The  total  experimental  time  was  ap¬ 
proximately  40  min. 
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Table  1 


Examples  of  Necessary-Information  Problems  and 
Extraneous-Information  Problems 


Necessary-information 

problem 

Extraneous-information 

problem 

There  are  5  pencils. 

There  are  23  apples. 

There  are  1 .6  times  more  pens 

There  are  3  times  more  oranges 

than  pencils. 

than  apples. 

How  many  pens  are  there? 

There  are  8  times  more  cranes 

than  annles. 

How  many  oranges  are  there? 

Note.  The  underlined  sentence  is  extraneous  information. 


Results  and  Discussion 

One  participant  was  excluded  from  the  analysis  because  of  low 
accuracy  (68%)  in  selecting  a  correct  expression.  Thus,  data  from 
77  participants  were  analyzed.  Whole  reading,  translation,  integra¬ 
tion,  and  planning  times  were  calculated  after  excluding  the  error 
trials.  Trials  were  excluded  if  an  individual’s  whole  reading  time 
exceeded  the  mean  ±  2  standard  deviations  (SD).  For  analysis  of 
planning  time,  trials  over  the  mean  ±2  SD  for  planning  time  were 
excluded.  All  reading  times  were  standardized  by  each  subject 
because  there  were  individual  differences  in  times  (whole  reading, 
translation,  and  integration  times).  Table  2  shows  basic  statistics 
for  all  measures.  The  phonological  updating  score  was  measured 
by  a  letter  memory  task,  and  the  visual  updating  score  was  ob¬ 
tained  using  a  visual  2-back  task. 

Accuracies  for  selecting  a  correct  expression  were  analyzed 
using  a  paired  sample  t  test  for  word  problem  type  (necessary- 
information  problem  vs.  extraneous-information  problem).  This 
analysis  showed  a  significant  difference  in  accuracy  between  word 
problem  types,  t( 76)  =  2.98,  p  <  .01,  which  indicated  that  par¬ 
ticipants  made  more  errors  with  extraneous-information  problems 
than  with  necessary-information  problems.  Similarly,  whole  read¬ 
ing  time  was  analyzed  using  a  paired  sample  t  test  for  word 
problem  type.  This  analysis  also  showed  a  significant  difference, 
<(76)  =  -22.69,  p  <  .001,  indicating  that  participants  had  more 
difficulty  and  needed  more  time  to  comprehend  an  extraneous- 
information  problem  than  a  necessary-information  problem. 

Individual  differences  of  updating  in  solving  word  problems. 
To  estimate  the  contribution  of  the  updating  function  in  solving 
word  problems,  we  used  a  multiple  regression  analysis,  with 
translation,  integration,  and  planning  times  as  dependent  variables 
and  problem  type,  phonological  updating  score,  visual  updating 
score,  and  these  interactions  as  independent  variables.  Before  the 
analysis,  the  phonological  updating  score  and  visual  updating 
score  were  centered  on  the  mean.  Problem  type  was  coded  as  a 
dummy  variable,  with  0  indicating  a  necessary-information  prob¬ 
lem  and  1  an  extraneous-information  problem.  Table  3  shows 
results  of  regression  analysis  for  each  dependent  variable. 

Analysis  of  translation  time  showed  that  phonological  updating 
did  not  show  any  significant  contribution.  This  result  suggested 
that  the  phonological  updating  function  was  not  important  in 
translation.  Furthermore,  word  problem  type  and  interaction  of 
visual  updating  and  word  problem  type  were  significant,  r(  1 46)  = 
21.92,  p  <  .001;  ?(146  =  —2.20,  p  <  .05).  These  results  indicated 
that  although  translation  time  was  longer  for  extraneous- 


information  problems  than  for  necessary-information  problems, 
higher  visual  updating  reduced  this  tendency.  The  requirement  for 
more  translation  time  with  extraneous-information  problems  was 
mostly  due  to  the  extraneous-information  sentence.  Higher  visual 
updating  solvers  might  be  faster  at  encoding  sentences  than  lower 
visual  updating  solvers.  This  effect  was  observed  especially  in 
extraneous-information  problems  because  these  problems  included 
an  additional  sentence  to  be  encoded. 

In  integration  reading  time,  regression  analysis  revealed  a  sig¬ 
nificant  contribution  to  word  problem  type,  f(146)  =  12.63,  p  < 
.001.  This  further  revealed  a  significant  contribution  for  the  inter¬ 
action  of  phonological  updating  and  word  problem  type, 
f(146)  =  -2.01,  p  <  .05,  but  not  for  phonological  updating.  These 
results  indicated  that  in  necessary  problems,  the  phonological 
updating  function  did  not  contribute  to  integration,  while  in  extra¬ 
neous  problems,  the  higher  phonological  updating  function  re¬ 
duced  integration  time.  These  results  support  our  hypothesis  that 
the  updating  function  aids  an  efficient  integration  process. 

Analysis  for  planning  time  did  not  show  any  significant  contri¬ 
bution.  Even  problem  type  was  not  significant.  This  indicated  that 
planning  time  was  the  same  for  necessary-information  and 
extraneous-information  problems,  suggesting  that  problem  solvers 
completed  the  integration  process  before  selecting  an  expression  in 
this  experiment. 

In  summary,  in  their  integration,  lower  phonological  updating 
solvers  were  more  strongly  affected  by  extraneous  information 
than  higher  phonological  updating  solvers.  The  integration  time 
difference  might  be  caused  by  differences  in  their  updating  func¬ 
tions.  Specifically,  findings  regarding  integration  time  indicate  that 
the  integration  process  depends  on  the  updating  function.  If  the 
integration  process  differs  by  the  updating  function,  the  problem 
model  that  arises  from  the  integration  process  can  be  different. 
However,  the  nature  of  the  problem  model  resulting  after  the 
integration  process  was  not  clear  in  this  experiment  and  requires 
further  investigation. 

According  to  Kintsch  (1992),  the  integration  process  includes 
elaboration  of  the  problem  model.  In  this  elaboration,  one  forms  an 
appropriate  problem  model  that  activates  only  the  information 
required  to  solve  the  problem.  However,  extraneous  information  is 
not  activated  in  such  a  clear  problem  model.  One  possibility  that 
can  account  for  the  difference  between  phonological  updating 
abilities  is  that  the  updating  function  can  help  elaborate  a  problem 
to  construct  a  clear  problem  model. 

An  alternative  explanation  is  that  working  memory  capacity 
(WMC)  is  also  available.  Problem  solvers  with  high  WMC  could 

Table  2 


Basic  Statistics  for  Whole  Reading,  Translation,  Integration, 
and  Planning  Times;  Accuracy  in  Selecting  an  Expression;  and 
Updating  Functions  in  Experiment  1 


Variable 

M 

f 

SD 

Min 

Max 

Whole  reading  time 

10,732.10 

3,959.68 

3,621.33 

23,049.50 

Translation  time 

8,949.21 

3,771.67 

2,588.80 

23,049.50 

Integration  time 

3,944.38 

1,629.80 

1,275.00 

9,757.83 

Planning  time 

1,781.65 

573.97 

804.00 

4,382.50 

Accuracy 

.94 

.07 

.75 

1.00 

Phonological  updating 

.62 

.25 

.06 

1.00 

Visual  updating 

.84 

.10 

.46 

1.00 
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Table  3 

Standardized  Partial  Regression  Coefficients  of  Regression  Analysis  for  Translation,  Integration, 
and  Planning  Times  in  Experiment  1 


Independent  variable 

Translation  time 

Integration  time 

Planning  time 

Word  problem  type 

1.75*** 

1.43*** 

.03 

Phonological  updating 

-.06 

.09 

-.07 

Visual  updating 

.10 

-.08 

.07 

Phonological  updating  X  Visual  updating 

.04 

.02 

.05 

Phonological  updating  X  Word  problem  type 

.08 

-.23* 

-.12 

Visual  updating  X  Word  problem  type 

-.18* 

.16 

.05 

Three-way  interaction 

-.07 

-.04 

-.04 

Adjusted  R-squared 

F(7,  146) 

.76 

69.75*** 

.51 

23.80*** 

-.02 

.58 

Note.  Asterisks  indicate  significance  of  the  coefficients  (*  =  .05.  ***  =  .01). 


easily  solve  arithmetic  word  problems  compared  with  problem 
solvers  who  had  low  WMC  (e.g.,  Rasmussen  &  Bisanz,  2005; 
Geary,  2004).  If  problem  solvers  with  high  WMC  form  more 
appropriate  problem  models  than  those  with  low  WMC,  the  model 
provided  by  the  high  WMC  can  include  all  information  linked  to 
a  problem.  This  problem  model  might  use  extraneous  information 
to  ensure  high  activation.  Investigation  into  the  specific  nature  of 
activation  based  on  information  included  in  a  problem  model 
suggests  which  explanation  is  appropriate.  Therefore,  we  need  to 
explore  the  level  of  activation  of  both  necessary  and  extraneous 
information  after  integration. 

Experiment  2 

The  purpose  of  the  experiment  was  to  reveal  the  nature  of  the 
problem  model  as  a  function  of  updating.  We  used  a  lexical- 
decision  task  for  this  purpose.  Participants  were  required  to  decide 
whether  a  presented  stimulus  was  a  word  or  not.  This  type  of 
decision  is  made  more  rapidly  if  the  same  word  has  been  previ¬ 
ously  presented,  because  the  stimulus  word’s  representation  is 
already  activated.  A  lexical-decision  task  is  useful  for  investigating 
the  representation’s  activation  in  working  memory. 

We  predicted  response  times  (RTs)  in  the  lexical-decision  task. 
If  problem  solvers  with  high  updating  formed  a  clear  problem 
model,  RTs  for  necessary-information  words  could  be  faster  than 
those  for  extraneous-information  words.  In  lower  updating  solvers, 
representations  of  both  extraneous-  and  necessary-information 
words  could  still  be  activated.  This  is  because  their  problem  model 
could  not  be  updated  and  they  were  likely  to  form  a  problem  model 
including  extraneous  information.  Therefore,  in  comparison  to  a 
novel  word,  the  lexical  decision  for  these  two  words  could  be  more 
rapid.  The  RTs  for  extraneous-  and  necessary-information  words 
might  not  differ  for  lower  updating  solvers,  while  RTs  for 
necessary-information  words  could  be  faster  than  those  for 
extraneous-information  words  in  higher  updating  solvers.  Investi¬ 
gating  RTs  for  these  three  types  of  words  in  a  lexical-decision  task 
would  enable  us  to  explore  the  nature  of  the  problem  model. 

Participants 

In  Experiment  2,  73  undergraduate  and  graduate  students  in 
Japan  (29  males)  participated.  Their  average  age  was  19.59  years. 
Most  participants  were  recruited  from  lectures  of  introduction  to 


psychology  in  2013  and  2015.  Although  they  participated  volun¬ 
tarily,  they  received  course  credits  or  a  book  coupon  worth  500 
yen.  Some  participants  were  recruited  personally  from  the  re¬ 
searcher’s  acquaintances,  who  were  not  very  familiar.  All  partic¬ 
ipants  were  native  Japanese  speakers  and  were  enrolled  in  the 
following  courses:  Social  Science,  Human  Science,  Engineering, 
Science,  and  Health  Science.  Math-related  courses  were  Engineer¬ 
ing  and  Science.  None  had  participated  in  Experiment  1. 

Tasks  and  Procedures 

Arithmetic  word  problem-solving  task.  To  explore  activa¬ 
tion  immediately  after  the  integration  process,  we  assigned  the 
lexical-decision  task  between  reading  a  problem  and  selecting  a 
correct  expression. 

First,  participants  were  instructed  to  read  a  problem  until  they 
comprehended  it.  In  the  first  phase,  each  sentence  of  a  problem 
was  displayed  individually  so  that  participants  were  free  to  choose 
which  sentence  they  wanted  to  read.  After  participants  pressed  a 
key  that  indicated  they  had  understood  the  problem,  they  were 
given  the  lexical  decision  phase.  In  this  phase,  two  stimuli  (con¬ 
sisting  of  a  word  and  a  nonword)  were  presented  horizontally  on  a 
computer  screen.  Participants  were  required  to  select  a  word  from 
these  two  stimuli  by  pressing  a  key  (the  left  key  for  the  stimulus  on 
the  left  and  the  center  key  for  the  stimulus  on  the  right).  After  the 
lexical  decision  phase,  two  expressions  were  presented  horizon¬ 
tally  on  the  computer  screen.  One  of  the  two  expressions  was  the 
correct  solution  to  the  arithmetic  word  problem.  Participants  were 
required  to  select  the  correct  expression  by  pressing  a  key. 

A  set  of  24  arithmetic  word  problems  was  developed.  Half  of  the 
problems  used  extraneous  information  that  was  selected  in  a  pseu¬ 
dorandom  manner.  This  selection  was  counterbalanced  across  par¬ 
ticipants.  Extraneous-information  problems  had  an  assignment, 
two  related  sentences,  and  a  question.  The  two  sentences  included 
an  extraneous  sentence,  which  was  unnecessary  for  solving  the 
problem,  and  a  relevant  sentence,  which  was  necessary  for  solving 
it.  A  variable  noun  in  each  sentence  was  called  the  extraneous- 
information  word  or  the  necessary-information  word  for  this 
study’s  purposes. 

Although  a  typical  lexical-decision  task  requires  a  decision 
about  a  word  or  a  nonword  in  one  presented  stimulus,  the  lexical- 
decision  task  in  this  study  required  participants  to  select  a  word 
stimulus  from  two-letter  strings.  After  the  lexical-decision  task, 
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selection  of  a  correct  expression  from  two  expressions  was  imme¬ 
diately  presented.  We  used  this  lexical-decision  task  to  reduce 
cognitive  load  from  task  switching.  In  the  lexical-decision  task,  the 
presented  words  had  three  conditions:  (a)  the  necessary- 
information  word  condition,  (b)  the  extraneous-information  word 
condition,  and  (c)  the  novel  word  condition.  Based  on  word 
frequency  norms  for  Japanese  (NTT  database  series;  Atnano  & 
Kondo,  1999),  72  words  written  in  Hiragana  were  collected.  Half 
of  the  words  had  two  characters  and  the  other  half  had  three.  We 
prepared  24  sets  of  three  words  that  corresponded  to  the  three 
conditions.  A  set  of  three  words  had  the  same  length  of  characters 
and  similar  frequency.  By  linking  two-letter  syllables  not  normally 
associated  with  each  other  (Umemoto,  Morikawa,  &  Ibuki,  1955), 
24  nonwords  were  created.  Nonwords  comprising  three  letters 
were  formed  by  adding  one  letter  to  two-letter  syllables.  Further¬ 
more,  six  words  and  four  nonwords  were  prepared  for  four  practice 
problems. 

Updating  tasks.  A  letter  memory  task  and  a  visual  n-back 
task  were  used  to  index  the  updating  function.  The  visual  n-back 
task  was  devised  based  on  two  main  features  from  Experiment  1 
(adapted  from  Friedman  et  al.,  2008).  The  first  feature  was  the 
presentation  duration.  In  Experiment  1,  the  stimulus  was  displayed 
until  participants  pressed  a  key.  To  ensure  more  updating  in 
Experiment  2,  the  stimulus  was  displayed  for  1,000  ms,  followed 
by  a  blank  screen  for  1,500  ms.  Participants  pressed  a  key  indi¬ 
cating  whether  or  not  the  stimulus  was  same  as  the  one  that  had 
been  displayed  as  the  n-previous  stimulus  in  that  blank.  The 
second  feature  was  the  presentation  stimulus.  One  black  square 
and  nine  white  squares  were  presented  on  a  computer  screen.  As  in 
Experiment  1,  the  phonological  updating  score  was  the  percentage 
of  correct  responses  and  the  visual  updating  score  was  the  2-back 
task  score. 

Results  and  Discussion 

Analysis  of  word  problem-solving  process  in  problems  con¬ 
taining  extraneous  information.  Whole  reading,  translation, 
integration,  and  planning  times  were  analyzed  after  excluding  error 
trials  to  select  an  expression.  Trials  were  also  excluded  for  analysis 
if  whole  reading  time  was  over  the  mean  ±  2  SD  for  each  subject. 
For  analysis  of  planning  time,  trials  over  the  mean  ±2  SD  for 
planning  time  were  excluded.  One  participant  was  excluded  be¬ 
cause  she  made  many  errors  in  the  lexical-decision  task  and 
struggled  to  select  the  correct  expression  (correct  responses  were 


92%  in  the  lexical-decision  task  and  92%  in  selecting  an  expres¬ 
sion).  Table  4  shows  the  basic  statistics  for  all  measures. 

Accuracy  in  selecting  correct  expressions  was  analyzed  using  a 
paired  sample  t  test  for  word  problem  type  (necessary  vs.  extra¬ 
neous).  This  analysis  showed  a  significant  difference  between 
word  problem  types,  t(71)  =  2.65,  p  <  .01.  Furthermore,  whole 
reading  time  was  analyzed  using  a  paired  sample  t  test  for  word 
problem  type.  This  analysis  showed  that  the  difference  between 
whole  reading  times  for  word  problem  types  was  significant, 
f(71)  =  -25.03,  p  <  .001.  These  results  indicated  that  extraneous- 
information  problems  are  more  difficult  than  necessary- 
information  problems  for  participants’  comprehension  and  solu¬ 
tion. 

Same  as  that  in  Experiment  1,  translation,  integration,  and 
planning  times  were  analyzed  with  the  multiple  regression  method. 
Table  5  shows  results  of  regression  analysis.  Word  problem  type, 
phonological  updating,  visual  updating,  and  these  interactions 
were  used  as  independent  variables.  The  results  showed  a  signif¬ 
icant  contribution  of  word  problem  type  to  translation  time, 
f(136)  =  20.57,  p  <  .001.  For  integration  reading  time,  the  results 
revealed  that  the  contribution  of  word  problem  type,  f(136)  = 
17.82,  p  <  .001  and  interaction  of  phonological  updating  and  word 
problem  type,  r(136)  —  —  2.17,  p  <  .05  were  significant.  These 
results  indicated  that  lower  phonological  updating  solvers  were 
more  influenced  by  extraneous  information  than  higher  phonolog¬ 
ical  updating  solvers. 

Analysis  of  planning  time  showed  that  the  contribution  of  pho¬ 
nological  updating  was  significant,  r(136)  =  -3.58,  p  <  .001.  The 
results  indicated  that  the  higher  the  phonological  updating,  the 
faster  the  selection  of  a  correct  expression.  The  results  of  the  word 
problem-solving  process  in  Experiment  2  were  consistent  with 
those  in  Experiment  1 . 

Nature  of  problem  model  and  updating.  We  required  par¬ 
ticipants  to  execute  the  lexical-decision  task  immediately  after  the 
integration  process  to  reveal  differences  between  problem  models 
of  high  and  low  updating  solvers.  The  lexical-decision  task  for 
extraneous  problems  was  analyzed. 

Table  4  shows  that  the  accuracy  for  lexical  decisions  was  very 
high.  Trials  over  the  mean  ±2  SD  for  RTs  and  error  trials  in  the 
lexical-decision  task  and  in  selecting  an  expression  were  excluded. 
RTs  for  lexical  decision  tasks  were  also  analyzed  using  multiple 
regression  analysis.  To  analyze  target  word  type,  we  created  two 
dummy  variables:  a  necessary  condition  and  an  extraneous  condi- 


Table  4 

Basic  Statistics  for  Reading  Time  in  Each  Phase,  Accuracy  for  Selecting  an  Expression, 
Reaction  Time,  Accuracy  for  Lexical  Decision,  and  Updating  Functions  in  Experiment  2 


Variable 

M 

SD 

Min 

Max 

Whole  reading  time 

10,054.06 

3,311.47 

3,511.25 

21,838.50 

Translation  time 

8,633.82 

3,114.11 

1,994.92 

15,727.18 

Integration  time 

3,220.31 

1,581.40 

941.75 

11,158.83 

Planning  time 

1,117.98 

345.73 

491.00 

2,620.75 

Accuracy 

.98 

.03 

.83 

1.00 

RTs  for  lexical  decision 

863.71 

214.14 

513.75 

1,929.25 

Accuracy  for  lexical  decision 

1.00 

.01 

.92 

1.00 

Phonological  updating 

.65 

.22 

.19 

1.00 

Visual  updating 

.92 

.11 

.49 

i.OO 

UPDATING  FUNCTION  IN  SOLVING  WORD  PROBLEMS 
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Table  5 

Standardized  Partial  Regression  Coefficients  of  Regression  Analysis  for  Translation,  Integration, 
and  Planning  Times  in  Experiment  2 


Independent  variable 

Translation  time 

Integration  time 

Planning  time 

Word  problem  type 

1.73*** 

1.66*** 

.26 

Phonological  updating 

-.02 

.10 

—.40*** 

Visual  updating 

-.03 

.01 

.10 

Phonological  updating  X  Visual  updating 

.02 

.04 

-.02 

Phonological  updating  X  Word  problem  type 

.04 

-.20* 

-.01 

Visual  updating  X  Word  problem  type 

.09 

-.02 

-.04 

Three-way  interaction 

-.04 

-.08 

-.04 

Adjusted  R-squared 

F(  7,  136) 

.75 

60.92*** 

.69 

46.07*** 

.14 

4  27*** 

Note.  Asterisks  indicate  significance  of  the  coefficients  (*  =  .05.  ***  =  .01). 


tion.  Target  word  type  was  coded  as  follows:  Necessary- 
information  words  were  coded  as  1  in  the  necessary  condition  and 
0  in  the  extraneous  condition.  Extraneous-information  words  were 
coded  as  0  in  the  necessary  condition  and  1  in  the  extraneous 
condition.  Novel  words  were  coded  as  0  in  the  necessary  condition 
and  0  in  the  extraneous  condition.  Two  dummy  variables,  phono¬ 
logical  updating,  visual  updating,  and  these  interactions  were  used 
as  independent  variables.  Table  6  shows  results  of  this  regression 
analysis. 

The  results  showed  significant  contributions  of  phonological  up¬ 
dating,  t{ 204)  =  — 4.95,p  <  .001,  visual  updating,  t{ 204)  =  2.53 ,p  < 
.05,  necessary  condition,  t( 204)  =  -10.34,  p  <  .001,  and  extraneous 
condition,  r(204)  =  —  7.03,  p  <  .001.  The  results  indicated  that  the 
higher  the  phonological  updating,  the  shorter  the  RTs  for  lexical 
decisions.  On  the  other  hand,  the  higher  the  visual  updating,  the 
longer  the  RTs  for  lexical  decisions.  The  results  of  target  word  type 
showed  that  RTs  for  necessary-information  words  were  faster  than 
those  for  extraneous-information  and  novel  words,  and  the  RTs  for 
extraneous-information  words  were  faster  than  those  for  novel  words. 
Furthermore,  interaction  of  phonological  updating  and  extraneous 
condition  was  significant,  r(204)  =  —2.08,  p  <  .05.  This  indicated 
that  the  influence  of  phonological  updating  for  RTs  for  extraneous- 
information  words  was  weaker  than  that  for  novel  and  necessary- 
information  words. 

These  results  showed  that  both  necessary-  and  extraneous- 
information  words  were  strongly  activated  for  lower  phonological 
updating  solvers.  Conversely,  the  priming  effect  of  necessary  infor¬ 


mation  was  stronger  than  that  of  extraneous  information  for  higher 
phonological  updating  solvers.  Necessary  information  was  activated 
more  strongly  than  extraneous  information  in  a  problem  model.  High 
phonological  updating  solvers  formed  a  clear  problem  model  by 
updating  problem  models  during  integration.  Those  with  lower  pho¬ 
nological  updating,  even  after  integration,  had  equal  activation  of 
necessary  information  and  extraneous  information. 

In  summary,  results  of  the  lexical-decision  task  revealed  that  the 
nature  of  the  problem  model  depends  on  individual  differences  related 
to  phonological  updating.  Problem  solvers  with  high  phonological 
updating  constructed  a  problem  model  that  included  task-relevant 
information  only.  Thus,  the  integration  process  in  word  problem 
solving  depends  on  one’s  updating  function.  Problem  solvers  with 
low  phonological  updating  might  construct  their  problem  model  with 
all  relevant  and  extraneous  information.  A  problem  solver’s  updating 
function  is  one  of  the  most  important  cognitive  factors  in  constructing 
a  problem  model  during  the  integration  process. 

General  Discussion 

Constructing  a  Problem  Model  and  the  Role  of  the 
Updating  Function 

The  purpose  of  this  study  was  to  reveal  the  relationship  between 
the  integration  process  and  the  updating  function  in  solving  arith¬ 
metic  word  problems.  In  two  experiments,  we  found  that  integra- 


Table  6 

Standardized  Partial  Regression  Coefficients  of  Regression  Analysis  for  RTs  in  Lexical  Decision  Tasks  in  Experiment  2 


Main  effect 

Two-way  interaction 

Three-way  interaction 

Necessary  condition 

-1.34*** 

Phonological  updating  X  Visual 

.01 

Phonological  updating  X  Visual 

.01 

updating 

updating  X  Necessary  condition 

Extraneous  condition 

_  91*** 

Phonological  updating  X  Necessary 

.22 

Phonological  updating  X  Visual 

.06 

condition 

updating  X  Extraneous  condition 

Phonological  updating 

-.45*** 

Phonological  updating  X  Extraneous 

.27* 

Visual  updating 

.25* 

Visual  updating  X  Necessary 

-.12 

condition 

Visual  updating  X  Extraneous 

-.09 

Adjusted  R-squared  =  .40 

condition 

F(ll,  204)  =  14.19*’* 

Note.  Asterisks  indicate  significance  of  the  coefficients  (  .05.  .01). 
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tion  time  for  an  extraneous  problem  was  longer  than  when  neces¬ 
sary  information  alone  was  given.  Additionally,  the  effect  of 
extraneous  information  on  integration  was  stronger  in  a  low  pho¬ 
nological  updating  solver  than  in  a  high  phonological  updating 
solver.  These  results  suggested  that  low  updating  solvers  struggled 
to  form  a  problem  model,  especially  for  problems  involving  ex¬ 
traneous  information.  The  results  support  our  hypothesis  that  up¬ 
dating  is  important  in  the  integration  process  in  solving  arithmetic 
word  problems. 

The  results  of  Experiment  2  support  the  assumption  that  differ¬ 
ences  in  the  updating  function  cause  differences  in  the  problem 
model.  In  the  lexical-decision  task,  the  higher  the  phonological 
updating,  the  faster  the  decisions.  However,  this  effect  was  weaker 
for  the  extraneous-information  word  condition.  This  indicated  that 
activation  of  necessary  information  was  higher  than  that  of  extra¬ 
neous  information  in  higher  phonological  updating  solvers.  How¬ 
ever,  in  lower  phonological  updating  solvers,  decisions  for 
necessary-  and  extraneous-information  words  might  be  equally 
facilitated.  This  indicated  the  same  activation  of  necessary  and 
extraneous  information.  Note  that  both  types  of  information  were 
highly  activated.  High  phonological  updating  solvers  formed  a 
problem  model  that  included  only  necessary  information,  while 
low  phonological  updating  solvers  created  a  problem  model  that 
also  included  extraneous  information.  These  problem  models 
might  influence  planning  time.  Results  of  planning  time  showed 
that  the  higher  the  phonological  updating,  the  shorter  the  planning 
time.  Higher  phonological  updating  solvers  could  form  a  correct 
expression  based  on  a  clear  problem  model,  while  for  lower 
phonological  updating  solvers,  planning  should  be  based  on  an 
appropriate  problem  model  that  included  extraneous  information. 

The  results  of  Experiment  2  are  consistent  with  a  study  by 
Passolunghi  and  Pazzaglia  (2004)  who  investigated  recall  error 
reported  by  low  and  high  updating  participants  after  solving  word 
problems.  Their  results  indicated  that  the  correct  recall  of  neces¬ 
sary  information  was  higher  in  the  high  updating  than  in  the  low 
updating  group.  High  updating  solvers  maintained  necessary  in¬ 
formation  in  working  memory.  According  to  our  findings,  these 
results  might  occur  because  of  the  activation  of  the  information 
included  in  a  problem  model  after  the  integration  process. 

The  findings  in  our  study  revealed  that  the  integration  process  in 
word  problem  solving  depends  on  the  updating  function.  This  is 
because  problem  solvers  update  a  problem  model  to  form  an 
appropriate  problem  model  during  comprehension. 

Updating  or  Working  Memory  Capacity? 

It  is  important  to  consider  whether  the  facilitation  of  integration 
was  an  effect  of  updating  or  simply  WMC.  It  may  be  the  case  that 
an  individual  who  has  a  large  WMC  can  perform  updating  tasks 
efficiently  and  reduce  integration  time  because  he  or  she  can  store 
all  information  in  working  memory  (Daneman  &  Carpenter,  1980). 
If  this  is  the  case,  such  individuals  are  expected  to  integrate  all 
information  into  a  problem  model  easily. 

However,  this  possibility  was  rejected  because  of  our  results 
from  Experiment  2.  As  for  results  of  the  lexical-decision  task,  in 
lower  phonological  updating  solvers,  the  facilitation  effect  was 
equal  for  necessary-  and  extraneous-information  words.  This  was 
inconsistent  with  the  account  of  simple  WMC  because  the  capacity 
of  low  phonological  updating  individuals  was  enough  to  store  all 


information.  High  updating  solvers  updated  their  problem  model. 
These  results  were  more  likely  caused  by  the  updating  function 
rather  than  WMC. 

On  the  other  hand,  recently,  WMC  seemed  to  exhibit  our  ability 
to  control  our  attention  (e.g.,  Engle,  Tuholski,  Laughlin,  &  Con¬ 
way,  1999;  Kane,  Bleckley,  Conway,  &  Engle,  2001).  From  this 
viewpoint,  the  present  study’s  results  were  caused  by  WMC  as 
general  attentional  control  rather  than  the  updating  function.  In¬ 
deed,  the  updating  function  has  been  shown  to  be  highly  correlated 
with  WMC  as  measured  by  the  operation  span  task  (Miyake  et  al., 
2000;  St  Clair-Thompson  &  Gathercole,  2006).  Although  the 
updating  function  is  an  underlying  process  of  WMC,  there  is 
controversy  over  the  relationship  between  updating  function  and 
WMC. 

However,  our  results  revealed  that  constructing  an  appropriate 
problem  model  does  not  mean  the  ability  to  maintain  all  informa¬ 
tion  as  highly  activated,  but  rather,  the  ability  to  activate  only  the 
necessary  information.  This  view  of  the  updating  function  seems  to 
account  better  for  our  results. 

Origin  of  Updating 

There  are  some  studies  focusing  on  the  biological  mechanism  of 
updating.  Dahlin,  Neely,  Larsson,  Backman,  and  Nyberg  (2008) 
demonstrated  the  way  in  which  two  updating  tasks  (a  letter  mem¬ 
ory  task  and  a  3-back  task)  indicated  overlapping  activation  of  the 
prefrontal  cortex  (PFC)  and  striatum.  She  also  showed  that  acti¬ 
vation  increases  in  the  striatum  after  5  weeks  of  updating  task 
training.  This  suggests  that  the  updating  function  is  related  to  the 
striatum.  O’Reilly’s  prefrontal  cortex  basal  ganglia  working  mem¬ 
ory  model  also  attempted  to  demonstrate  that  updating  requires  the 
striatum,  which  is  part  of  the  basal  ganglia  (O’Reilly,  2006; 
O’Reilly  &  Frank,  2006).  According  to  this  model,  the  PFC 
maintains  task-relevant  information  and  the  basal  ganglia  offer 
selective  gating  of  some  PFC  regions.  If  the  state  of  the  striatum  is 
“No-Go,”  updating  does  not  occur  and  information  in  working 
memory  is  maintained.  If  the  striatum  state  is  “Go,”  information  in 
working  memory  is  updated.  Thus,  task-relevant  information  is 
stored  and  updated  in  working  memory. 

This  model  could  account  for  the  results  of  the  present  study.  In 
Experiment  2,  high  updating  caused  differential  activation  for 
necessary  and  extraneous  information.  Conversely,  low  updating 
led  to  similar  activation  with  necessary  and  extraneous  informa¬ 
tion.  According  to  the  prefrontal  cortex  basal  ganglia  working 
memory  model,  in  a  high  updating  individual,  task-relevant  infor¬ 
mation  is  maintained  and  extraneous  information  is  excluded  from 
working  memory  via  a  selective  gating  mechanism.  In  contrast, 
problem  solvers  with  low  updating  may  have  deficits  in  this  gating 
mechanism,  such  that  extraneous  information  might  be  retained  in 
working  memory.  Increase  of  integration  time  in  low  updating 
solvers  could  reflect  that  they  attempt  to  update  more  frequently. 
These  accounts  could  link  studies  for  Word  problem  solving  with 
associated  biological  mechanisms  and  computational  modeling. 
There  is  a  need  for  further  consideration  of  these  topics. 

Updating  and  Modality 

Domains  in  the  updating  function  should  also  be  considered.  In 
the  present  study,  only  phonological  updating  showed  a  relation- 
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ship  with  integration  time.  Visual  updating  did  not  indicate  such  a 
relationship.  The  reason  for  this  could  be  that  processing  word 
problems  requires  verbal  encoding,  which  depends  on  the  phono¬ 
logical  loop.  Consequently,  verbal  information  was  encoded  dur¬ 
ing  integration,  and  only  updating  in  the  phonological  domain 
showed  a  relationship  with  integration.  According  to  this  interpre¬ 
tation,  updating  of  the  visuospatial  sketchpad  did  not  play  an 
important  role  in  solving  word  problems  in  this  study.  However, 
the  visuospatial  sketchpad  did,  in  some  cases,  contribute  more  to 
the  word  problem  than  the  phonological  loop.  The  contribution  to 
word  problems  changed  with  age  over  time,  even  when  problems 
were  the  same  (Meyer  et  al„  2010).  Second-grade  students  showed 
contribution  of  the  phonological  loop  and  central  executive,  while 
third-grade  students  demonstrated  contribution  of  the  visuospatial 
sketchpad.  Processing  word  problems  depended  on  the  mental 
model  of  third-grade  students  who  were  skilled  in  translating 
words  into  mental  representations,  which  requires  the  visuospatial 
sketchpad.  These  findings  lead  to  a  possibility  that  adults  are  more 
likely  to  encode  word  problems  as  mental  models.  Nevertheless, 
the  present  study  did  not  show  the  importance  of  visual  updating. 
Extraneous  information  included  in  the  present  study  was  of  a 
verbal  nature.  Problem  solvers  could  make  decisions  about  rele¬ 
vance  based  on  the  verbal  information  given.  The  relationship 
between  phonological  updating  and  integration  was  strongly  dem¬ 
onstrated.  This  suggests  the  possibility  that  problems  demanding 
encoding  of  visual  information  could  show  a  contribution  of  visual 
updating.  Such  problems  (e.g.,  geometric  problems)  should  be 
investigated  to  further  understand  the  relationship  between  the 
updating  function  and  integration. 

Educational  Implications 

Our  findings  suggest  that  the  less  updating  function  causes 
unsuccessful  integration.  This  is  an  explanation  about  unsuccessful 
integration  from  cognitive  process  perspective.  Our  findings  sug¬ 
gest  the  possibility  that  children  with  lower  updating  have  a 
problem  with  the  integration  process.  These  children  might  im¬ 
prove  their  performance  in  solving  word  problems  if  we  could 
train  their  updating  function  with  working  memory  training.  Al¬ 
though  working  memory  training  has  recently  increased  and  its 
significance  has  been  reported  (Au  et  al.,  2015),  the  training’s 
effectiveness  in  solving  arithmetic  word  problems  needs  further 
investigation.  Our  findings  identified  a  process  in  which  updating 
is  important.  This  would  help  to  develop  such  training. 

Our  findings  also  suggest  that  children  with  lower  updating 
function  will  struggle  with  integration  due  to  the  need  for  updating 
their  problem  model.  In  turn,  this  suggests  a  way  in  which  problem 
solvers  with  lower  updating  may  be  helped.  Their  performance 
could  be  improved  when  problem  model  updating  is  not  required, 
that  is,  if  problem  solvers  know  how  to  construct  a  problem  model 
in  advance,  they  might  be  able  to  construct  a  problem  model  that 
includes  the  relevant  information.  Indeed,  it  was  reported  that 
problems  that  use  a  question  as  the  first  sentence  are  easier 
(Robinson  &  Hayes,  1978).  In  this  case,  problem  solvers  could 
activate  their  problem  schema  before  reading  the  problem  and  then 
they  could  integrate  the  information  based  on  the  schema.  This 
would  reduce  the  updating  load.  Therefore,  our  findings  emphasize 
that  when  problem  solvers  have  difficulties  with  their  updating,  it 
is  important  to  design  instruction  to  reduce  updating. 
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Appendix  A 

Examples  of  Word  Problems  in  Experiment  1 

WP  1.  There  are  five  wooden  pencils. 

There  are  2.6  times  as  many  mechanical  pencils  as  wooden  pencils. 

How  many  mechanical  pencils  are  there? 

WP  2.  A  rectangle  is  28  cm  long. 

It  is  26  cm  wide. 

What  is  its  area? 

WP  3.  A  rectangle  is  28  cm  long. 

A  diagonal  of  this  rectangle  is  32  cm. 

This  rectangle  is  15  cm  wide. 

What  is  the  rectangle’s  area? 

WP  4.  There  are  2.5  1  of  soy  sauce. 

A  mirin  is  1.4  times  as  much  soy  sauce. 

How  many  liters  in  the  mirin? 

WP  5.  There  are  23  apples. 

There  are  3  times  as  many  oranges  as  the  apples. 

There  are  8  times  as  many  grapes  as  apples. 

How  many  oranges  are  there? 

WP  6.  The  distance  to  a  destination  is  18  km. 

I  move  at  6  km  per  hour  on  foot. 

How  many  hours  do  I  take  to  get  to  the  destination? 

WP  7.  There  are  four  dogs. 

There  are  5  more  cats  than  dogs. 

There  are  1.5  times  as  many  birds  as  dogs. 

How  many  birds  are  there? 

WP  8.  The  distance  to  the  destination  is  30  km. 

I  move  at  6  km  per  hour  on  foot. 

I  move  at  15  km  per  hour  by  bicycle. 

How  long  does  it  take  me  to  get  to  the  destination  on  foot? 

Note.  Underlined  sentences  are  extraneous  information.  These  examples  were  translated  into  English  from 
Japanese. 
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Appendix  B 

Examples  of  Word  Problems  in  Experiment  2 


WP  1 .  The  distance  to  a  destination  is  48  km. 

A  snake  crawls  at  7  km/h. 

How  many  hours  does  the  snake  take  to  get  to  the  destination? 

WP  2.  A  scallop  weighs  120  g. 

A  cod  roe  weighs  20  g  more  than  the  scallop. 

A  soft  seaweed  weighs  30  g  more  than  the  scallop. 

How  much  does  the  cod  roe  weigh? 

WP  3.  There  are  4  hours  to  move. 

A  donkey  does  19  km/h. 

How  far  does  the  donkey  move? 

WP  4.  A  duck  covers  2  km  in  an  hour. 

A  gull  covers  4  km  in  an  hour. 

They  move  for  6  h. 

How  far  does  the  gull  move? 

WP  5.  There  are  9  oranges. 

There  are  3  times  as  many  pears  as  oranges. 

How  many  pears  are  there? 

WP  6.  A  lily  is  40  cm  taller  than  a  dandelion. 

The  dandelion  is  30  cm. 

How  tall  is  the  lily? 

WP  7.  A  distance  to  a  destination  was  17  km. 

A  rabbit  took  8  h  to  get  to  the  destination. 

A  sheep  took  6  h  to  get  to  the  destination. 

How  many  kilometers  an  hour  did  the  rabbit  cover? 

WP  8.  There  are  3  times  as  many  scissors  as  pencils. 

There  are  6  times  as  many  seals  as  pencils. 

There  are  4  pencils. 

How  many  seals  are  there? 

Note.  Underlined  sentences  are  extraneous  information.  A  word  in  necessary  information  was  used  as  a 
necessary-information  word  in  the  lexical  decision  task.  A  word  in  extraneous  information  was  used  as  an 
extraneous-information  word.  For  example,  rabbit  was  a  necessary-information  word  and  sheep  was  an 
extraneous-information  word  in  WP  7. 
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Effects  of  ABRACADABRA  Literacy  Instruction  on  Children  With  Autism 
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This  study  explored  the  effects  of  ABRACADABRA,  a  free  computer-assisted  literacy  program,  on  the 
reading  accuracy  and  comprehension  skills  of  children  diagnosed  with  autism  spectrum  disorder  (ASD). 
ABRACADABRA  is  a  balanced  literacy  instruction  program,  targeting  both  code  and  meaning-based 
reading  abilities.  Twenty  children  with  ASD,  aged  5-11  years,  were  assigned  by  matched  pairs  to  the 
instruction  group  or  wait-list  control  group.  Literacy  instruction  was  delivered  on  a  1:1  basis  in 
participants’  homes  over  a  13-week  period  (26  sessions  per  participant).  Pre  and  post  instruction 
assessment  using  standardized  measures  revealed  statistically  significant  gains  in  reading  accuracy  and 
comprehension  for  the  instruction  group  relative  to  the  wait-list  control  group,  with  large  effect  sizes. 
These  findings  indicate  that  children  with  ASD  may  benefit  from  ABRACADABRA  literacy  instruction. 

Keywords:  literacy,  reading  accuracy,  reading  comprehension,  autism,  ASD,  ABRACADABRA 


Early  literacy  skills  provide  a  foundation  for  lifelong  learning. 
Children  who  are  skilled  readers  are  more  likely  to  experience 
positive  academic  outcomes  and  encounter  fewer  emotional  and 
behavioral  difficulties  than  their  reading-delayed  peers  (Willcutt  et 
al.,  2007).  These  children  also  demonstrate  greater  motivation  to 
complete  academic  tasks  (Lyon,  1998)  and  are  less  inclined  to 
leave  school  early,  relative  to  less  skilled  readers  (Daniel  et  al., 
2006).  In  the  longer  term,  skilled  readers  achieve  more  positive 
employment  and  economic  outcomes  (Roman,  2004),  and  exhibit 
greater  health  awareness  than  adults  with  poorer  levels  of  reading 
ability  (DeWalt,  Berkman,  Sheridan,  Lohr,  &  Pignone,  2004). 
Given  the  potential  benefits  of  skilled  reading,  there  is  an  urgent 
need  to  establish  effective  literacy  instruction  for  all  children, 
including  those  with  disabilities  such  as  autism  spectrum  disorder 
(ASD). 

ASD  is  an  early  onset  developmental  disability  characterized  by 
deficits  in  social  communication,  restricted  patterns  of  interests, 
and  engagement  in  repetitive  behaviors  (American  Psychiatric 
Association,  2013).  ASD  is  conceptualized  as  a  spectrum  disorder, 
meaning  that  these  characteristics  manifest  heterogeneously 
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throughout  the  population  (Frazier  et  al.,  2012).  Global  epidemi¬ 
ological  research  suggests  that  the  median  ASD  prevalence  rate  is 
approximately  62/10,000  (Elsabbagh  et  al.,  2012).  ASD  commonly 
co-occurs  with  difficulties  in  the  areas  of  oral  language  (Lord,  Risi, 
&  Pickles,  2004),  cognition  (Chakrabarti  &  Fombonne,  2005),  and 
behavior  (Simonoff  et  al.,  2008).  Those  more  severely  affected  by 
ASD  are  more  likely  to  present  with  associated  comorbidities 
(Leyfer  et  al.,  2006).  The  core  characteristics  of  ASD,  as  well  as 
the  associated  comorbid  difficulties,  can  affect  the  literacy  devel¬ 
opment  of  children  within  this  population. 

Reading  and  ASD 

Reading  is  a  dynamic  process  involving  the  interaction  of  two 
distinct  components:  decoding  of  text  and  comprehension  of 
meaning  (Gough  &  Tunmer,  1986).  For  both  children  without 
disabilities  and  children  with  ASD,  these  component  reading  abil¬ 
ities  draw  heavily  on  underlying  cognitive  and  oral  language  skills 
(Jacobs  &  Richdale,  2013;  Nation  &  Snowling,  2004).  Given  that 
ASD  is  often  associated  with  deficits  in  cognition  and  oral  lan¬ 
guage,  it  follows  that  some  children  with  ASD  are  at  increased  risk 
of  experiencing  reading  difficulties.  Social-communicative  deficits 
and  behavioral  difficulties  may  also  restrict  the  ability  of  some 
children  with  ASD  to  adequately  engage  with  literacy  instruction, 
further  impeding  their  reading  development  (Williams,  Wright, 
Callaghan,  &  Coughlan,  2002). 

In  a  seminal  study  of  reading  and  autism,  Nation  and  colleagues 
(2006)  explored  the  reading  accuracy  and  comprehension  abilities 
of  41  children  diagnosed  with  ASD  (Nation,  Clarke,  Wright,  & 
Williams,  2006).  The  researchers  employed  broad  inclusion  crite¬ 
ria,  requiring  only  that  participants  were  aged  6  to  15  years  and  had 
measureable  oral  language  skills.  Their  analyses  revealed  that  a 
considerable  number  of  children  exhibited  difficulties  in  reading 
accuracy,  with  22%  of  the  participants  completely  unable  to  read 
single  words  and  nonwords.  Data  from  the  remaining  participants 
revealed  an  atypical  profile  of  reading  abilities  characterized  by 
relative  strengths  in  reading  accuracy  and  weaknesses  in  reading 
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comprehension.  A  comparable  reading  profile  has  been  observed 
in  recent  studies  involving  children  with  ASD  that  have  employed 
similar  inclusion  criteria  (e.g.,  Arciuli,  Stevens,  Trembath,  & 
Simpson,  2013). 

A  study  by  Nation  and  Snowling  (1997)  assessed  reading  dif¬ 
ficulties  in  children  without  disabilities.  Their  data  revealed  strong 
correlations  between  the  component  reading  abilities  of  reading 
accuracy  and  comprehension,  and  among  subcomponent  reading 
abilities,  such  as  word  level  reading  accuracy  and  passage  level 
reading  accuracy.  Likewise,  studies  involving  children  with  ASD 
have  reported  significant  correlations  between  component  and 
subcomponent  reading  abilities  (Arciuli  et  al.,  2013;  Nation  et  al., 
2006).  However,  the  correlations  found  in  these  studies  involving 
children  with  ASD  tend  to  be  lower  than  those  reported  in  studies 
of  children  without  disabilities.  These  findings  indicate  that  the 
component  and  subcomponent  reading  abilities  of  children  with 
ASD  may  develop  more  autonomously  by  comparison  with  chil¬ 
dren  who  do  not  have  disabilities. 

An  emerging  body  of  research  addresses  remediation  of  the 
reading  difficulties  exhibited  by  some  children  with  ASD.  In  their 
review,  Whalon,  Al  Otaiba,  and  Delano  (2009)  identified  1 1  stud¬ 
ies,  involving  a  total  of  61  participants,  which  targeted  some  of  the 
key  reading-related  abilities  as  defined  by  the  National  Reading 
Panel  (NRP;  National  Institute  of  Child  Health  and  Human  De¬ 
velopment  [NICHD],  2000).  Children  with  ASD  were  shown  to 
benefit  from  instruction  targeting  phonics  (e.g.,  Coleman-Martin, 
Heller,  Cihak,  &  Irvine,  2005),  oral  reading  fluency  (e.g.,  Kamps, 
Barbetta,  Leonard,  &  Delquadri,  1994),  vocabulary  (e.g.,  Kamps, 
Leonard,  Potucek,  &  Garrison-Harrell,  1995),  and  instruction  in 
comprehension  strategies  (e.g.,  Whalon  &  Hanline,  2008).  How¬ 
ever,  few  of  the  reviewed  studies  investigated  whether  these  in¬ 
structional  approaches  promoted  the  development  of  reading  ac¬ 
curacy  and  comprehension  skills  in  children  with  ASD.  In 
addition,  none  of  the  reviewed  studies  evaluated  the  effects  of 
comprehensive  literacy  instruction  that  targets  all  of  the  key 
reading-related  abilities  identified  by  the  NRP.  Given  that  children 
with  ASD  have  been  shown  to  benefit  from  some  individual 
elements  of  literacy  instruction,  the  evaluation  of  more  compre¬ 
hensive  instructional  approaches  is  of  critical  importance. 

ABRACADABRA 

ABRACADABRA  (hereafter  referred  to  as  ABRA;  Centre  for 
the  Study  of  Learning  and  Performance  [CSLP],  2009)  is  a  freely 
available  literacy  program  designed  to  improve  the  reading  and 
writing  skills  of  all  children,  including  those  at  risk  of  low  literacy 
abilities.  ABRA  learning  objectives  are  informed  by  the  recom¬ 
mendations  of  the  NRP  (NICHD,  2000)  and  other  reviews  of 
effective  reading  interventions  (see  Abrami  et  al.,  2010,  for  a 
description  and  explanation  of  the  development  of  ABRA).  Spe¬ 
cifically,  ABRA  targets  the  development  of  foundational  literacy 
skills  including  alphabetics,  reading  fluency,  reading  comprehen¬ 
sion,  and  writing.  Instruction  targeting  these  skills  is  delivered 
using  a  combination  of  computer  activities  and  noncomputerized 
extension  tasks.  According  to  Abrami  et  al.  (2010),  the  pedagog¬ 
ical  underpinnings  of  ABRA  are  intended  to  replicate  those  of 
balanced  literacy  programs,  as  described  by  Chall  (1967)  and 
Adams  (1990).  That  is,  ABRA  learning  activities  emphasize  a 
balance  between  children’s  code  (i.e.,  phonics  and  word  study)  and 


meaning-based  skill  development  (i.e.,  reading  comprehension), 
and  engagement  with  real  literature. 

ABRA  is  the  focus  of  an  ongoing  research  program  at  the  CSLP 
at  Concordia  University.  A  recent  meta-analysis  identified  nine 
randomized  control  trials  and  quasi-experimental  studies  that  have 
examined  the  effects  of  ABRA  on  literacy  outcomes  as  defined  by 
the  NRP  (Abrami,  Borohkovski,  &  Lysneko,  2015).  These  studies 
included  Kindergarten,  Grade  1,  and  Grade  2  children  from  diverse 
populations.  There  was  no  mention  of  children  with  disabilities. 
For  example,  research  conducted  by  Wolgemuth  et  al.  (2013) 
included  indigenous  Australian  children,  and  the  study  conducted 
by  Abrami  et  al.  (2014)  was  undertaken  in  Sub-Saharan  Africa. 
Across  studies,  students  received  between  10  to  32  hours  of  ABRA 
instruction  in  small  groups  or  whole  class  settings  for  periods 
ranging  from  8  to  16  weeks.  Generally,  these  previous  studies 
utilized  standardized  measures  to  assess  outcomes.  The  results  of 
the  meta-analysis  revealed  that  children  who  received  ABRA 
instruction  exhibited  statistically  significant  gains  in  phonemic 
awareness,  phonics,  vocabulary,  and  listening  comprehension 
compared  with  children  in  control  conditions,  with  small  effect 
sizes.  Improvements  in  reading  accuracy,  reading  comprehension, 
and  reading  fluency  were  evident  in  some  previous  studies;  how¬ 
ever,  these  gains  did  not  always  reach  statistical  significance. 
Divergent  findings  across  some  of  the  previous  studies  may  be 
attributed  to  differences  in  the  implementation  of  ABRA  (e.g., 
small  group  vs.  whole  class  administration  of  ABRA;  differences 
in  hours  of  instruction). 

Computer-Assisted  Instruction 

Pedagogical  approaches,  such  as  ABRA,  which  utilize 
computer-assisted  instruction  (CAI)  may  be  well  suited  to  children 
with  ASD  (Grynszpan,  Weiss,  Perez-Diaz,  &  Gal,  2014).  Unlike 
teacher-directed  instruction,  CAI  is  not  heavily  contingent  upon 
social  communicative  abilities,  which  are  a  key  deficit  for  children 
in  this  population  (Williams  et  al.,  2002).  Previous  research  has 
shown  that  children  with  ASD  tend  to  be  more  responsive  during 
CAI  that  targets  social,  language,  and  communication  develop¬ 
ment  as  compared  with  teacher  directed  approaches  (Ploog, 
Scharf,  Nelson,  &  Brooks,  2013). 

A  recent  review  evaluated  the  use  of  CAI  for  the  teaching  of 
reading  and  related  skills  for  children  with  ASD  (Ramdoss  et  al., 
2011).  Twelve  studies  were  included  in  the  review,  involving  a 
total  of  94  participants.  Evidence  supported  the  use  of  CAI  to 
develop  skills  associated  with  reading,  including  phonological 
awareness  (e.g.,  Heimann,  Nelson,  Tjus,  &  Gillberg,  1995),  recep¬ 
tive  language  (e.g.,  Whalen  et  al.,  2010),  vocabulary  (e.g.,  Moore 
&  Calvert,  2000),  and  sentence  construction  (e.g.,  Basil  &  Reyes, 
2003),  as  well  as  component  reading  skills,  decoding  (e.g.,  Tjus, 
Heimann,  &  Nelson,  1998),  and  reading  comprehension  (e.g., 
Basil  &  Reyes,  2003).  On  average,  analyses  revealed  large  effects 
for  these  CAI  programs.  However,  there  was  considerable  vari¬ 
ability  across  studies,  with  some  showing  CAI  to  be  no  more 
beneficial  for  children  with  ASD  than  teacher-led  instruction  (e.g., 
Travers  et  al.,  2011). 

Several  issues  need  to  be  considered  when  evaluating  the  effects 
of  CAI  in  children  with  ASD.  Previous  studies  have  often  involved 
small  samples,  many  comprised  of  less  than  10  participants  in 
total.  Thus,  some  studies  may  have  lacked  the  statistical  power  to 
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draw  definitive  conclusions.  In  addition,  some  previous  studies 
have  utilized  CAI  programs  that  are  difficult  to  access  (e.g., 
Coleman-Martin  et  al.,  2005),  costly  (e.g.,  Whalen  et  al„  2010), 
and/or  are  now  considered  outdated  (e.g.,  Heimann  et  al.,  1995). 
Thus,  the  real-world  applicability  of  many  of  the  CAI  programs 
that  have  been  evaluated  is  questionable.  Finally,  many  previous 
studies  have  relied  on  nonstandardized  measures  of  literacy  to 
evaluate  outcomes  following  CAI  (e.g.,  Whitcomb,  Bass,  &  Lui- 
selli,  2011).  The  lack  of  standardized  measures  in  the  previous 
research  limits  the  generalizability  of  some  of  these  studies. 

The  Current  Study 

In  the  current  study  we  sought  to  explore  the  effects  of  ABRA 
on  the  reading  skills  of  a  diverse  group  of  children  with  ASD. 
Literacy  instruction  was  delivered  using  ABRA’s  freely  available 
web  application  and  noncomputerized  extension  tasks.  Unlike 
previous  ABRA  research,  the  current  study  was  conducted  inde¬ 
pendently  of  the  CSLP. 

The  study  was  guided  by  three  research  questions. 

1.  Can  ABRA  instruction  improve  the  reading  skills  of 
children  with  ASD  when  compared  with  a  control  group 
of  children  with  ASD  who  do  not  receive  ABRA  instruc¬ 
tion? 

2.  Are  improvements  in  reading  ability  following  ABRA 
instruction  observed  across  both  word  and  passage  levels 
for  children  with  ASD? 

3.  How  large  are  the  improvements  in  reading  ability  fol¬ 
lowing  ABRA  instruction  for  children  with  ASD? 

We  hypothesized  that  participants  with  ASD  would  exhibit 
improved  reading  accuracy  and  comprehension  abilities  following 
13  weeks  of  ABRA  instruction  compared  with  a  wait-list  control 
group  of  children  with  ASD.  In  addition,  we  hypothesized  that  the 
relative  gains  achieved  by  participants  with  ASD  following  ABRA 
instruction  would  be  observed  across  three  aspects  of  reading 
ability:  word  level  accuracy,  passage  level  accuracy,  and  passage 
level  comprehension.  However,  we  were  unsure  about  the  size  of 
these  gains. 

Method 

Design 

The  study  followed  a  pretest/posttest  control  group  design. 
Participants  were  assigned  to  one  of  two  experimental  conditions: 
the  wait-list  control  group  or  the  instruction  group.  Pairs  of  par¬ 
ticipants  who  were  of  similar  age  and  had  comparable  oral  lan¬ 
guage,  reading  and  adaptive  abilities  were  identified.  Participants 
in  each  pairing  were  then  randomly  assigned  to  opposing  experi¬ 
mental  conditions  (i.e.,  the  wait-list  control  or  instruction  group). 
These  groupings  were  later  altered  slightly  to  accommodate 
changes  in  participant  availability  as  advised  by  parents  (i.e.,  one 
participant  was  removed  from  the  instruction  group  and  two  par¬ 
ticipants  were  added  in  their  place). 

Participants  in  the  instruction  group  received  home-based  1:1 
ABRA  instruction  over  a  period  of  13  weeks  (26  sessions  per 


participant).  Participants  in  the  wait-list  control  group  continued 
their  normal  academic  schedule  during  this  time.  Thus,  ABRA  was 
supplemental  for  the  instruction  group  while  the  wait-list  control 
group  went  about  their  school  activities  “business  as  usual.”  In¬ 
formation  was  not  collected  regarding  participants’  normal  school 
literacy  instruction.  Pre-  and  postinstruction  assessment  was  car¬ 
ried  out  at  the  University  or  in  the  participant’s  home  within  9  days 
of  the  instruction  period.  All  assessment  and  instruction  sessions 
were  conducted  by  the  first  author  who  is  a  certified  practicing 
speech  pathologist  with  previous  experience  working  with  children 
on  the  autism  spectrum. 

Participants 

Research  advertisements  were  circulated  throughout  speech  pa¬ 
thology  and  psychology  clinics  across  a  large  metropolitan  area 
within  Sydney,  Australia.  The  research  protocol  was  approved  by 
the  relevant  University’s  Human  Research  Ethics  Committee.  Le¬ 
gal  guardians  provided  written  informed  consent  prior  to  partici¬ 
pation. 

Eligibility  for  the  study  required  that  participants  met  the  fol¬ 
lowing  inclusion  criteria:  (a)  5-11  years  of  age;  (b)  previous 
formal  clinical  diagnosis  of  ASD  using  Diagnostic  and  Statistical 
Manual  (DSM)  criteria;  (c)  no  hearing  or  vision  impairments;  (d) 
measurable  language  ability;  and  (e)  able  to  demonstrate  sustained 
attention  to  tasks  for  1 5  min.  Of  an  initial  pool  of  25  participants, 
two  were  excluded  because  they  did  not  meet  the  inclusion  criteria. 
A  further  three  participants  were  excluded  because  of  conflicts  in 
scheduling.  Twenty  children  formed  the  final  sample,  of  whom  18 
were  male.  As  expected,  the  final  sample  was  highly  heteroge¬ 
neous  and  comprised  of  children  with  differing  levels  of  develop¬ 
mental,  adaptive  and  academic  functioning.  Participants  were  en¬ 
rolled  in  inclusive  education  (i.e.,  classrooms  with  peers  who  do 
not  have  disabilities),  support  classes  (i.e.,  classrooms  with  peers 
who  have  disabilities  within  a  school  for  students  without  disabil¬ 
ities),  or  specialist  settings  (i.e.,  schools  for  children  with  ASD). 
Demographic  and  diagnostic  information  by  group  (wait-list  con¬ 
trol  vs.  instruction)  is  shown  in  Table  1. 

Independent  samples  t  tests  with  alpha  set  at  .05  showed  no 
statistically  significant  differences  between  the  instruction  and 
wait-list  control  groups  for  age,  f(  1 8)  =  -3.54,  p  =  .73,  and  across 
baseline  measures  of  adaptive  ability,  vocabulary,  phonological 

Table  1 


Demographic  and  Diagnostic  Information  by  Group 


Characteristic 

Wait-list 

contro't 

ABRA 

instruction 

Agea 

90.22  (19.72) 

87.18  (18.65) 

Sex  (M:F) 

8:1 

10:1 

Reported  diagnosis  (ASD/Asp./PDD-NOS) 

5:1:3 

8:1:2 

Secondary  diagnoses  (ADHD/LD/AD) 

1:7:3 

2:9:8 

School  (Inclusive/Support/Specialist) 

6:3:0 

8:1:2 

>1  language  spoken  at  home  (Y/N) 

3:6 

5:6 

Note.  Asp.  =  Asperger’s  syndrome;  PDD-NOS  =  pervasive  develop¬ 
mental  disorder-not  otherwise  specified;  ADHD  =  attention-deficit/ 
hyperactivity  disorder;  LD  =  language  difficulties;  AD  =  articulation 
difficulties;  Inclusive  =  inclusive  class;  Support  =  support  class;  Special¬ 
ist  =  specialist  class.  Data  in  parentheses  are  SDs. 
a  Age  is  reported  in  months. 
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awareness,  word  level  reading  accuracy,  passage  level  reading 
accuracy,  and  passage  level  reading  comprehension  (see  Table  2 
for  independent  samples  t  test  results).  The  percentile  rank  mea¬ 
sures  shown  in  Table  2  were  calculated  using  normative  data 
derived  from  samples  that  included  a  majority  of  children  without 
disabilities  (see  the  Measures  section  for  further  details).  Norma¬ 
tive  data  based  solely  on  the  ASD  population  were  not  available 
for  any  of  the  measures. 

As  expected,  scores  for  most  participants  placed  them  well 
below  the  age-adjusted  average  on  the  measure  of  adaptive  ability 
(i.e.,  only  six  participants  achieved  percentile  rankings  above  the 
16th  percentile).  Scores  varied  considerably  within  each  group, 
and  across  each  of  the  measures,  reflecting  the  broad  inclusion 
criteria  utilized  in  the  current  study. 

Measures 

The  measures  used  in  the  current  study  were  selected  for  two 
purposes.  First,  tests  of  oral  language,  reading  and  adaptive  ability 
were  used  to  obtain  baseline  measures  in  order  to  assign  partici¬ 
pants  to  either  the  wait-list  control  or  instruction  group.  Second, 
tests  of  word  and  passage  level  reading  accuracy  and  comprehen¬ 
sion  were  used  to  evaluate  outcomes  following  ABRA  instruction. 
All  of  the  standardized  tests  included  in  the  protocol  are  widely 
used,  valid  and  reliable  measures.  Each  provides  age  or  year-of- 
schooling  referenced  percentile  ranks.  With  the  exception  of  adap¬ 
tive  ability,  all  assessments  were  administered  individually  to  each 
participant  by  the  first  author.  The  measure  of  adaptive  ability  was 
obtained  individually  via  semistructured  parent  interview  with  the 
first  author.  Participants  received  a  score  of  zero  if  unable  to 
satisfy  basal  level  performance  criteria  on  a  test. 

Where  known,  we  report  the  percentage  of  children  with  ASD  in 
the  normative  sample  associated  with  each  assessment.  Some 
normative  samples  included  children  with  ASD  but  considered 
these  children  as  belonging  to  broader  disability  classifications. 
For  these  samples,  the  percentage  of  children  in  autism-related 
classifications  is  reported.  For  the  remaining  normative  samples, 
the  percentage  of  children  with  disabilities  is  reported. 

Adaptive  ability.  Each  parent  participated  in  a  semistructured 
interview  using  the  Survey  Interview  Form  from  the  Vineland 
Adaptive  Behavior  Scales-2nd  Edition  (VABS-2;  Sparrow,  Cic- 
chetti,  &  Balia,  2005).  The  test  evaluated  the  domains  of  commu¬ 


nication  (receptive,  expressive,  written),  daily  living  skills  (per¬ 
sonal,  domestic,  community),  and  socialization  (interpersonal 
relationships,  play  and  leisure  time,  coping  skills).  Additional 
items  measuring  fine  and  gross  motor  skills  were  administered  to 
parents  of  participants  aged  six  years  and  younger  (n  =  10). 
Children  with  health  impairments,  traumatic  brain  injury,  multiple 
impairments,  and/or  autism  comprised  1.7%  of  the  VABS-2  nor¬ 
mative  sample.  Thus,  most  of  the  children  included  in  the  norma¬ 
tive  sample  did  not  have  disabilities.  For  the  current  sample,  the 
VABS-2  was  found  to  have  a  high  level  of  internal  consistency  for 
children  aged  seven  years  and  older  (Cronbach’s  alpha  =  .97),  and 
children  aged  six  years  and  younger  (Cronbach’s  alpha  =  .99). 

Vocabulary.  The  Peabody  Picture  Vocabulary  Test-4th  edi¬ 
tion  (PPVT-4;  Dunn  &  Dunn,  2007)  is  a  test  of  receptive  vocab¬ 
ulary.  Using  Form  A,  participants  were  instructed  to  select  one  of 
four  images  best  illustrating  a  target  word  verbally  presented  by 
the  researcher.  The  PPVT-4  includes  many  simple  items  to  im¬ 
prove  measurement  of  lower  functioning  and  younger  children. 
Children  with  ASD  constituted  0.2%  of  the  PPVT-4  normative 
sample.  The  vast  majority  of  the  remaining  children  did  not  have 
disabilities.  For  the  current  sample,  the  PPVT-4  was  found  to  have 
a  high  level  of  internal  consistency  (Cronbach’s  alpha  =  .98). 

Phonological  awareness.  The  Phonological  Awareness  Com¬ 
posite  Score  (PACS)  from  the  Comprehensive  Test  of  Phonolog¬ 
ical  Processing  -  2nd  Edition  (CTOPP  -  2;  Wagner,  Torgesen, 
Rashotte,  &  Pearson,  2013)  was  used  to  assess  phonological 
awareness.  The  PACS  is  comprised  of  three  related  subtests:  (a) 
Elision,  which  is  a  sound  deletion  task  that  measures  ability  to 
segment  and  manipulate  sounds  within  words;  (b)  Blending 
Words,  where  participants  listened  to  a  series  of  audio-recorded 
sounds  and  were  then  required  to  blend  these  sounds  together  to 
form  a  whole  word;  and  (c)  Sound  Matching  (for  participants  aged 
five  to  six  years  only),  in  which  participants  identified  one  picture 
from  a  choice  of  three  that  began  with  the  same  sound  as  a  word 
read  by  the  researcher.  Participants  aged  7  to  11  years  (n  =  10) 
completed  a  Phoneme  Isolation  task,  where  they  were  instructed  to 
identify  the  phoneme  occupying  a  specified  position  in  a  target 
word.  Children  with  learning  or  health  impairments  constituted 
approximately  5%  of  the  CTOPP-2  normative  sample — the  re¬ 
mainder  of  the  sample  did  not  have  disabilities.  For  the  current 
sample,  the  CTOPP-2  was  shown  to  have  a  high  level  of  internal 


Table  2 


Mean  Age-Based  Percentile  Rank  for  Each  Preinstruction  Measure  by  Group 


Wait-list  control  (n 

=  9) 

ABRA  instruction  (n 

=  ID 

Measure 

M 

SD 

Range 

M 

SD 

Range 

t(18) 

P 

Cohen’s  d 

Adaptive  ability 

17.56 

25.74 

1-84 

18.36 

19.93 

2-63 

.08 

.94 

.04 

Vocabulary 

29.14 

30.83 

.3-79 

26.00 

19.45 

1-53 

.29 

.78 

.12 

Phonological  awareness 

16.11 

17.93 

0-39 

15.09 

20.04 

0-63 

.(2 

.91 

.05 

Word  level  reading  accuracy 

38.89 

35.44 

2-87 

43.27 

31.61 

2-98 

.29 

.77 

.13 

Passage  level  reading  accuracy3 

19.89 

25.90 

0-65 

25.45 

26.77 

0-81 

.47 

.64 

.21 

Reading  comprehension3 

16.67 

26.98 

0-68 

15.55 

17.28 

0-53 

.11 

.91 

.05 

Note.  Adaptive  ability:  Vineland  Adaptive  Behavior  Scales  (VABS-2),  Adaptive  Behavior  Composite;  Vocabulary:  Peabody  Picture  Vocabulary  Test 
(PPVT-4);  Phonological  awareness:  Comprehensive  Test  of  Phonological  Processing  (CTOPP-2),  Phonological  Awareness  Composite  Score;  Word  level 
reading  accuracy:  Wide  Range  Achievement  Test  (WRAT-4),  Word  Identification  subtest;  Passage  level  reading  accuracy  and  reading  comprehension: 
Neale  Analysis  of  Reading  Ability  (NARA-3). 

a  Data  for  passage  level  reading  accuracy  and  reading  comprehension  are  year-of-schooling  based  percentile  ranks. 
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consistency  for  children  aged  six  years  and  younger  (Cronbach’s 
alpha  =  .97),  and  children  aged  seven  years  and  older  (Cronbach’s 
alpha  =  .98). 

Word  level  reading  accuracy.  The  Word  Reading  subtest  of 
the  Wide  Range  Achievement  Test- 4th  Edition  (WRAT-4; 
Wilkinson  &  Robertson,  2006)  was  used  to  measure  participants’ 
ability  to  accurately  decode  letters  and  words.  Participants  were 
directed  to  read  aloud  a  list  of  individual  letters  followed  by  a  list 
of  real  words.  Word  reading  targets  were  arranged  in  order  of  least 
(e.g.,  “cat”)  to  most  difficult  (e.g.,  “usurp”).  Children  with  disabil¬ 
ities  constituted  approximately  5%  of  the  WRAT-4  normative 
sample.  The  remaining  children  in  the  normative  sample  did  not 
have  disabilities.  For  the  current  sample,  the  WRAT-4  was  found 
to  have  a  high  level  of  internal  consistency  (Cronbach’s  alpha  = 
.95). 

Passage  level  reading  accuracy.  The  Reading  Accuracy 
Composite  Score  from  the  Neale  Analysis  of  Reading  Ability-3rd 
edition  (NARA-3;  Neale,  1999)  was  used  to  assess  participants’ 
ability  to  accurately  decode  passage  level  text.  This  assessment 
required  participants  to  read  aloud  a  series  of  passages  of  increas¬ 
ing  length  and  complexity.  The  NARA-3  manual  does  not  report 
the  number  of  children  with  ASD,  or  other  disabilities,  in  its 
normative  sample.  For  the  current  sample,  the  reading  accuracy 
composite  was  found  to  have  high  internal  consistency  (Cron¬ 
bach’s  alpha  =  .82). 

Passage  level  reading  comprehension.  The  Reading  Com¬ 
prehension  Composite  Score  from  the  NARA-3  (Neale,  1999)  was 
used  to  assess  participants’  ability  to  derive  meaning  from  written 
text  at  the  passage  level.  This  involved  participants  reading  a  series 
of  passages  aloud  before  being  asked  a  number  of  prescribed 
questions  related  to  the  text.  For  the  current  sample,  the  reading 
comprehension  composite  was  shown  to  have  high  internal  con¬ 
sistency  (Cronbach’s  alpha  =  .95). 

Procedure 

Preinstruction  assessment.  Participants  completed  a  stan¬ 
dardized  assessment  battery  of  reading,  oral  language  and  adaptive 
abilities  (see  Measures  section  for  a  description  of  these  assess¬ 
ments).  The  battery  was  necessary  because  we  wanted  to  make 
sure  that  the  groups  were  equivalent  prior  to  one  group  receiving 
instruction.  Assessment  tasks  were  administered  in  the  order  in 
which  they  appear  in  the  preceding  section.  Assessment  sessions 
ranged  from  60-  to  90-min  duration,  depending  on  the  abilities  and 
behaviors  of  individual  participants. 

ABRACADABRA  instruction.  ABRA  was  implemented  as 
per  the  standard  recommended  protocol  with  the  exception  of  two 
purposeful  adaptations,  which  were  discussed  with  and  approved 
by  the  CSLP.  First,  as  a  consequence  of  the  1:1  setting  used  in  the 
current  study,  ABRA  instruction  sessions  did  not  include  collab¬ 
orative  work  with  child  peers.  Instead,  additional  time  was  as¬ 
signed  to  the  computer  activities  and  a  reward  task  at  the  end  of  the 
session,  and  participants  worked  collaboratively  with  the  first 
author  during  the  ABRA  extension  tasks  (e.g.,  taking  turns  reading 
pages  of  a  story).  Second,  in  anticipation  that  some  children  with 
ASD  would  perform  less  consistently  than  children  without  dis¬ 
abilities,  the  criterion  used  to  identify  skill  mastery  was  lowered 
slightly  from  90%  to  85%  accuracy  (further  details  regarding  skill 
mastery  are  provided  below). 


ABRA  activities  targeted  four  key  literacy  abilities:  (a)  alpha¬ 
bets,  (b)  reading  fluency,  (c)  reading  comprehension,  and  (d) 
writing  (Table  3).  Word  level  activities  used  to  promote  alphabet- 
ics  (i.e.,  the  ability  to  associate  sounds  with  letters  and  use  these 
sounds  to  create  words)  were  presented  in  a  hierarchical  sequence. 
The  sequence  began  with  early  developing  skills  (e.g.,  sound 
matching)  and  ended  with  more  complex  tasks  (e.g.,  word  seg¬ 
mentation  and  blending).  Word  attack  skills  targeted  during  the 
word  level  computer  tasks  were  incorporated  into  passage  level 
reading  fluency  and  comprehension  tasks.  For  example,  partici¬ 
pants  could  click  on  unfamiliar  words  in  the  passage  level  predic¬ 
tion  task  and  observe  them  being  decoded.  Writing  tasks  required 
participants  to  type  word  and  passage  level  targets  on  a  computer 
to  dictation.  Within  most  activities,  skill  development  and  task 
autonomy  were  targeted  using  a  system  of  least  (e.g.,  encouraging 
independent  decoding)  to  most  (e.g.,  demonstrated  decoding) 
prompts.  Reward  contingencies  (e.g.,  shots  in  a  hockey-themed 
comprehension  game)  were  used  to  encourage  ongoing  participant 
motivation  and  engagement. 

ABRA’s  balanced  curriculum  and  graded  learning  tasks  per¬ 
mitted  highly  individualized  literacy  instruction.  The  prein¬ 
struction  assessment  data  was  used  to  inform  the  researchers  of 
each  participant’s  profile  of  literacy  abilities  (i.e.,  relative 
strengths  and  weaknesses).  These  profiles  were  used  in  con¬ 
junction  with  the  ABRA  manual  to  identify  learning  objectives, 
tasks,  and  task  difficulty  settings  appropriate  for  instruction. 
Learning  objectives,  tasks,  and  associated  task  difficulty  set¬ 
tings  were  reviewed  following  each  instruction  session  using 
ongoing  measures  of  participant  performance.  A  performance 
criterion  of  65%-85%  accuracy  was  employed  to  identify  tasks 
of  appropriate  content  and  difficulty  for  instruction.  Skill  mas¬ 
tery  was  set  at  85%  accuracy  for  each  independent  task,  main¬ 
tained  over  three  consecutive  sessions. 

Instruction  consisted  of  two  60-min  training  sessions  delivered 
weekly  over  a  13-week  period  working  1:1  with  participants.  Instruc¬ 
tion  sessions  were  conducted  outside  of  school  hours,  and  therefore 
necessarily  in  participants’  homes,  to  minimize  disruption  to  school 
activities.  Computer  activities  were  presented  on  a  15.6”  laptop  with 
participants  seated  one  meter  from  the  screen  at  eye  level.  These 
activities  were  designed  to  encourage  independent  participation  (e.g., 
animated  videos  demonstrating  task  completion  appeared  prior  to 
each  activity).  However,  the  experimenter  was  present  for  the  dura¬ 
tion  of  each  session  and  assisted  participants  in  transitioning  between 
tasks.  All  participants  had  at  least  some  ability  to  independently 
operate  a  standard  computer  mouse  prior  to  commencing  ABRA 
instruction.  Some  participants  received  additional  support,  in  the  form 
of  hand-over-hand  assistance,  to  operate  the  hardware  during  tasks 
which  required  rapid  responses.  Breaks  were  provided  to  participants 
as  required  throughout  the  instruction  sessions. 

Each  60-min  ABRA  session  followed  a  routine  structure.  First, 
participants  completed  a  15-min  computer  task  targeting  word  level 
abilities  (i.e.,  alphabetics,  high-frequency  word  identification,  or  word 
spelling).  Next,  participants  completed  a  20-min  computer  task  tar¬ 
geting  passage  level  abilities  (i.e.,  reading  fluency,  reading  compre¬ 
hension,  or  sentence  spelling).  Skills  targeted  during  the  computer 
activities  were  then  revisited  during  a  15-min,  noncomputerized  ex¬ 
tension  task  which  involved  interaction  between  the  experimenter  and 
participant  (e.g.,  shared  reading  or  spelling  games).  Consistent  with 
previous  ABRA  research,  these  extension  tasks  were  guided  by  the 
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Table  3 

ABRACADABRA  Activities 


Literacy  domain 
Alphabetics 


Reading  fluency 


Reading  comprehension 


Task  level 


All  alphabetics  tasks  were 
word  level 


Word  level 
Passage  level 
Passage  level 

Passage  level 
Passage  level 

All  reading  comprehension 
tasks  were  passage  level 


Task  name 


Matching  sounds 
Alphabet  song 
Word  counting 
Syllable  counting 
Same  word 
Same  phoneme 
Word  matching 

Animated  alphabet 


Letter  sound  search 

Letter  identification  bingo 
Rhyme  matching 
Word  families 

Auditory  blending 
Auditory  segmenting 

Blending  train 

Basic  decoding 

Word  changing 

High  frequency  words 

Tracking 

Expression 


Accuracy 

Speed 

Prediction 

Comprehension  monitoring 
Sequencing 

Summarizing 


Vocabulary 
Vocabulary  (ESL) 


Writing 


Word  level 
Passage  level 


Story  response 
Story  elements 

Spelling  words 
Spelling  sentences 


Task  description 

Identify  matching  sounds 
Sing  along  to  the  alphabet  song 
Count  words  in  an  audio-recording 
Count  syllables  in  an  audio-recording 
Identify  same  vs.  different  words 
Identify  same  vs.  different  phonemes 
Identify  words  with  same  vs.  different  initial  or 
final  phoneme 

Watch  animations  featuring  letter  sounds,  a  letter¬ 
writing  cue  and  an  alliterative  phrase  for  each 
letter  of  the  alphabet 

Identify  letters  corresponding  to  audio-recorded 
phonemes 

Identify  letters  by  name 
Identify  pairs  of  rhyming  words 
Substitute  initial  letter(s)  to  form  a  new  word  (e.g., 
map  — »  mat  — >  bat) 

Match  phonemically  segmented  word  to  image 
Match  audio-recording  of  full  word  to  segmented 
version  of  target  word 

Identify  target  word  following  phonemically 
segmented  audio-recording 
Decode  written  word  and  match  to  corresponding 
image 

Substitute  letters  to  form  a  new  word  (e.g.,  rat  — * 
mat  — >  map) 

Identify  a  list  of  high  frequency  words 
Scan  passage  level  text  from  left  to  right 
Identify  audio-recording  as  being  read  with  good 
vs.  bad  expression  then  read  the  same  passage 
aloud  with  appropriate  expression 
Read  passage  of  text  without  error 
Read  passage  of  text  at  appropriate  pace 
Predict  future  events  during  passage  level  narrative 
Identify  words  incorrectly  substituted  in  the  text 
Place  story  images  in  linear  order  following 
reading 

Respond  to  questions  during  passage  level  reading 
task  (questions  designed  to  highlight  important 
plot  elements) 

Select  sentences  containing  correct  use  of  a  target 
word 

Match  audio-recorded  words  to  corresponding 
images.  Participants  then  included  target  words 
in  a  cloze  passage. 

Respond  verbally  to  questions  following  reading 
Respond  to  multiple  choice  questions  following 
reading 

Words  typed  to  dictation 
Sentences  typed  to  dictation 


recommendations  of  the  ABRA  manual.  At  the  end  of  each  session, 
participants  were  rewarded  with  a  10-min  free  choice  activity  (e.g., 
Legos). 

Postinstruction  assessment.  The  postinstruction  assessment 
included  three  outcome  measures:  (a)  word  level  reading  accuracy,  (b) 
passage  level  reading  accuracy,  and  (c)  passage  level  reading  com¬ 
prehension. 

Implementation  Fidelity 

Implementation  fidelity  was  addressed  across  three  levels:  context, 
compliance,  and  competence  fidelity  (Fixsen,  Naoom,  Blase,  Fried¬ 


man,  &  Wallace,  2005).  Context  fidelity  requires  that  the  precursors 
necessary  for  effective  instruction  are  in  place  prior  to  a  program’s 
implementation.  In  the  current  study,  the  first  author  ensured  high 
context  fidelity  prior  to  beginning  instruction  by  gaining  access  to  the 
ABRA  learning  materials  and  the  ABRA  Learning  Tool  Kit  Teacher 
Guide  (hereafter  referred  to  as  the  ABRA  manual;  Abrami,  White,  & 
Wade,  2010),  and  by  completing  ABRA  administration  training. 
ABRA  administration  training  comprised  two  sessions  conducted  by 
a  representative  from  the  CSLP.  During  these  training  sessions,  the 
first  author  and  CSLP  representative  discussed  the  theoretical,  devel¬ 
opmental  and  pedagogical  underpinnings  of  ABRA.  The  CSLP  rep- 
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Table  4 

Mean  Raw  Scores  Pre-  and  Postinstruction  for  Each  Outcome  Measure  by  Group 


Wait-list  control  (n 

=  9) 

ABRA  instruction  ( n 

=  11) 

Measure 

M 

SD 

Range 

M 

SD 

Range 

Preinstruction 

Word  level  reading  accuracy 

25.33 

12.29 

8-45 

25.82 

10.80 

8-43 

Passage  level  reading  accuracy 

20.11 

23.60 

0-64 

20.09 

18.64 

0 — 45 

Passage  level  reading  comprehension 

5.67 

8.50 

0-22 

5.00 

5.33 

0-15 

Postinstruction 

Word  level  reading  accuracy 

24.89 

12.24 

1-M 

28.64 

10.21 

13-43 

Passage  level  reading  accuracy 

19.44 

22.75 

0-65 

25.82 

21.29 

1-58 

Passage  level  reading  comprehension 

5.89 

8.94 

0-25 

8.64 

7.89 

1-25 

Note.  Word  level  reading  accuracy:  Wide  Range  Achievement  Test  (WRAT-4),  Word  Identification  subtest; 
Passage  level  reading  accuracy  and  reading  comprehension:  Neale  Analysis  of  Reading  Ability  (NARA-3). 


resentative  also  presented  information  relevant  to  the  administration 
of  ABRA  in  the  current  study  (e.g.,  the  modifications  necessary  to 
accommodate  1:1  instruction)  and  demonstrated  the  use  of  the  pro¬ 
gram.  Before  and  between  the  training  sessions,  the  first  author  was 
required  to  complete  readings,  including  previous  ABRA  research, 
ABRA  manual  (Abrami,  White,  &  Wade,  2010)  and  the  teacher’s 
zone  on  the  CSLP  website,  and  become  proficient  in  the  use  of  the 
ABRA  software. 

Compliance  fidelity  refers  to  the  degree  to  which  the  core  elements 
of  a  program  are  utilized  during  its  implementation.  Consistent  with 
the  recommendations  of  the  CSLP  (as  per  the  ABRA  manual  and 
ABRA  administration  training),  instruction  sessions  were  individually 
planned  to  include  computer  and  noncomputerized  learning  tasks 
targeting  a  balance  of  code  and  meaning-based  learning  objectives. 
Each  participant’s  progression  through  these  learning  objectives  was 
documented  to  ensure  that  all  children  completed  the  prescribed  26  hr 
of  instruction  and  that  learning  objectives  complied  with  the  recom¬ 
mendations  of  the  CSLP  (e.g.,  performance  criterion  of  85%  accuracy 
was  used  to  identify  skill  mastery).  In  these  ways,  written  documents 
composed  during  the  instruction  period  (i.e.,  session  plans  and  session 
notes)  show  that  the  core  elements  of  ABRA  instruction,  including 
instructional  content  and  duration,  were  implemented  in  the  current 
study. 

Competence  fidelity  is  the  level  of  skill  with  which  the  core 
elements  of  a  program  are  delivered  during  its  implementation.  The 
standardized  nature  of  the  ABRA  computer  activities  goes  some  way 
toward  ensuring  competence  fidelity  in  the  current  study.  For  exam¬ 
ple,  preprogrammed  video  models  that  are  embedded  within  the 
ABRA  computer  program  itself  ensured  that  participants  received  an 
appropriate  introduction  to  each  computerized  ABRA  activity.  How¬ 
ever,  external  measures  relating  to  the  first  author’s  implementation  of 
ABRA  were  not  collected.  As  such,  competence  fidelity  cannot  be 
independently  verified  in  the  current  study. 

Results 

Raw  scores  for  each  of  the  outcome  measures  are  provided  in 
Table  4.  As  can  be  seen,  children  in  the  wait-list  control  group 
maintained  or  showed  slight  decreases  in  their  raw  scores  over 
time.  By  contrast,  children  in  the  instruction  group  showed  in¬ 
creases  in  their  raw  scores. 

As  the  children  within  each  group  were  of  different  ages  and  grades 
we  converted  raw  scores  to  percentile  ranks.  The  effects  of  ABRA 


instruction  on  participants’  reading  performance  were  evaluated  using 
a  series  of  2  X  2  analyses  of  variance  (ANOVAs;  Time  X  Group) 
with  a  =  .05.  The  dependent  variable  used  in  these  analyses  was 
either  age-based  percentile  rank  (for  the  measure  of  word  level  read¬ 
ing  accuracy  percentile  rank  is  calculated  based  on  age  in  months)  or 
year-of-schooling  referenced  percentile  rank  (for  the  measures  of 
passage  level  reading  accuracy  and  comprehension  percentile  ranks 
are  calculated  based  on  grade).  ANOVAs  conducted  using  partici¬ 
pants’  raw  scores  are  also  reported. 

Word  Level  Reading  Accuracy 

A  statistically  significant  interaction  effect  was  observed  for 
Time  X  Group  on  the  word  level  reading  accuracy  measure,  F(  1, 
18)  =  5.73,  p  <  .05,  with  a  large  effect  size,  T)p  =  .24. 1  As  shown 
in  Figure  1,  scores  for  participants  in  the  instruction  group  in¬ 
creased  from  pre-  to  postinstruction  assessment,  suggesting  im¬ 
proved  word  level  reading  ability.  By  contrast,  scores  for  the 
wait-list  control  group  decreased  between  these  two  time  points.2 
Analysis  of  raw  scores  revealed  a  similar  result,  Time  X  Group 
interaction:  F(l,  18)  =  12.50,  p  <  .01,  r\l  =  .41. 

Passage  Level  Reading  Accuracy 

Analysis  of  the  passage  level  reading  accuracy  data  revealed  a 
statistically  significant  Time  X  Group  interaction,  F(l,  18)  = 
10.50,  p  <  .01,  with  a  large  effect  size,  rip  =  .37.  Figure  2  shows 
an  increase  in  mean  percentile  rank  for  the  instruction  group, 
suggesting  an  improvement  in  passage  level  reading  accuracy, 
while  there  was  relatively  little  change  in  the  reading  scores  of 
the  wait-list  control  group.  Analysis  of  raw  scores  showed  a 


1  -rip  of  .01  is  considered  to  be  a  small  effect  size,  .06  a  medium  effect 
size,  and  .14  a  large  effect  size  (Richardson,  2011). 

2  Note  that  the  wait-list  control  group  achieved  very  similar  raw  scores 
on  the  WRAT-4  at  pre-  and  postinstruction  assessment  (25.33  vs.  24.89). 
The  slight  decrease  in  percentile  rank  (shown  in  Figure  1)  is  likely  because 
of  the  particular  norming  method  used  in  the  WRAT-4  (i.e.,  norms  are 
based  on  age  in  months).  That  is,  for  the  wait-list  control  group,  partici¬ 
pants’  raw  scores  at  postinstruction  assessment  corresponded  to  slightly 
lower  percentile  rankings  because  these  participants  were  not  making  the 
kind  of  progress  that  would  be  expected  with  increasing  age  as  was  seen  in 
the  normative  sample  (largely  comprised  of  individuals  without  disabili¬ 
ties). 
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Figure  1.  Mean  percentile  rankings  for  word  level  reading  accuracy 
(Wide  Range  Achievement  Test-4th  Edition  [WRAT-4])  by  group. 
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Figure  3.  Mean  percentile  rankings  for  reading  comprehension  (Neale 
Analysis  of  Reading  Ability-3rd  edition  [NARA-3].)  by  group. 


similar  result,  Time  X  Group  interaction:  F(  1,  18)  =  12.38,  p  < 
.01,  T|p  =  .41. 

Passage  Level  Reading  Comprehension 

A  statistically  significant  Time  X  Group  interaction  was  found 
for  the  measure  of  passage  level  reading  comprehension,  F(l, 
18)  =  10.59,  p  <  .01,  with  a  large  effect  size.  Tip  =  .37.  As  shown 
in  Figure  3,  percentile  rank  scores  for  the  instruction  group  in¬ 
creased  from  pre-  to  postinstruction  assessment,  suggesting  an 
improvement  in  passage  level  reading  comprehension.  Scores  for 
the  wait-list  control  group  were  relatively  consistent  across  the 
two  time  points,  indicating  little  change  in  reading  comprehen¬ 
sion  skills.  Analysis  of  participants’  raw  scores  again  revealed 
a  similar  result,  Time  X  Group  interaction:  F(l,  18)  =  8.51, 
p  <  .01,  T]p  =  .32. 


Nonparametric  Analyses 

In  view  of  the  modest  sample  size,  the  effects  of  ABRA  instruc¬ 
tion  were  also  evaluated  using  nonparametric  Mann- Whitney  tests 
conducted  on  pre-/postinstruction  percentile  rank  difference  scores 
for  each  of  the  outcome  measures.  The  median  difference  score  for 
the  word  level  reading  accuracy  measure  was  3  for  the  instruction 
group  and  —  1  for  the  wait-list  control  group.  For  the  passage  level 
reading  accuracy  measure,  the  median  difference  score  was  13  for 
the  instruction  group  and  0  for  the  wait-list  control  group.  For  the 
passage  level  reading  comprehension  measure,  the  median  differ¬ 
ence  score  was  14  for  the  instruction  group  and  0  for  the  wait-list 
control  group.  With  alpha  set  at  .05,  analyses  showed  statistically 
significant  gains  for  the  instruction  group  relative  to  the  wait-list 
control  group  across  all  three  reading  measures:  word  level  reading 
accuracy  (U  =  21.00,  z  =  —2.17,  p  =  <.05),  passage  level 
reading  accuracy  (U  =  10.50,  z  =  -2.99,  p  =  <.01),  and  passage 
level  reading  comprehension  ( U  =  9.50,  z  =  —  3.07,  p  =  <.01). 
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Figure  2.  Mean  percentile  rankings  for  passage  level  reading  accuracy 
(Neale  Analysis  of  Reading  Ability— 3rd  edition  [NARA-3])  by  group. 


Discussion 

In  the  current  study  we  examined  the  effects  of  ABRACA¬ 
DABRA  literacy  instruction  on  the  reading  abilities  of  a  diverse 
group  of  children  with  ASD.  Our  research  was  guided  by  three 
questions  intended  to  ascertain  (a)  whether  ABRA  instruction 
could  be  used  to  facilitate  reading  development  in  children  with 
ASD;  (b)  whether  the  gains  achieved  using  ABRA  would  be 
observed  across  both  word  and  passage  level  reading  abilities;  and 
(c)  the  size  of  these  gains. 

We  hypothesized  that  participants  in  the  ABRA  instruction 
group  would  exhibit  improved  reading  abilities  compared  with  a 
wait-list  control  group.  Consistent  with  this  hypothesis,  partici¬ 
pants  in  the  instruction  group,  relative  to  the  wait-list  control 
group,  achieved  statistically  significant  gains  in  reading  accuracy 
and  comprehension  following  26  sessions  of  ABRA  instruction 
administered  over  a  13-week  period.  Our  second  hypothesis  was 
that  the  relative  gains  achieved  by  participants  in  the  instruction 
group  would  be  observed  across  both  word  and  passage  level 
reading  abilities.  The  data  revealed  statistically  significant  gains 
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for  the  instruction  group,  compared  with  the  wait-list  control 
group,  across  all  three  aspects  of  reading  ability  that  were  assessed 
(i.e.,  word  level  reading  accuracy,  passage  level  reading  accuracy, 
and  passage  level  reading  comprehension),  thus  confirming  our 
hypothesis. 

With  regard  to  our  third  hypothesis,  effect  size  calculations 
showed  considerable  gains  for  the  instruction  group  relative  to  the 
wait-list  control  group  across  each  of  the  evaluated  aspects  of 
reading.  Gains  achieved  in  the  word  level  reading  accuracy  skills 
of  the  instruction  group  compared  to  the  wait-list  control  group 
were  large  (r)p  =  -24),  suggesting  that  ABRA  instruction  was 
effective  in  facilitating  substantial  improvements  in  the  word  level 
reading  abilities  of  children  with  ASD.  By  comparison  with  the 
wait-list  control  group,  the  instruction  group  also  achieved  large 
gains  in  their  passage  level  reading  accuracy  (7^  =  .37)  and 
passage  level  reading  comprehension  (t^  =  .37).  It  is  interesting 
that  gains  in  the  instruction  group  were  more  pronounced  with 
regard  to  participants’  passage  level  reading  skills  as  compared 
with  their  word  level  reading  skills  (as  revealed  by  the  ANOVAs 
that  were  conducted  on  percentile  ranks — gains  were  equivalent  in 
the  ANOVAs  that  were  conducted  on  raw  scores).  This  uneven 
pattern  of  improvement  may  be  further  evidence  that  subcompo¬ 
nent  reading  skills  can  develop  more  autonomously  in  children 
with  ASD  as  compared  with  children  without  disabilities  (Arciuli 
et  al.,  2013;  Nation  et  al.,  2006). 

Previous  ABRA  Research 

The  results  presented  here  are  in  line  with  several  studies 
showing  that  ABRA  can  have  positive  effects  on  children’s  read¬ 
ing  abilities.  Indeed,  the  effect  sizes  reported  in  the  current  study 
compare  favorably  with  the  previous  ABRA  research.  For  exam¬ 
ple,  in  contrast  to  the  large  effects  reported  in  the  current  study, 
Wolgemuth  et  al.  (2013)  found  ABRA  instruction  to  have  a  sta¬ 
tistically  significant,  medium-sized  effect  ( d  =  .36)  on  the  com¬ 
bined  word  reading  accuracy  and  phonological  awareness  skills  of 
children  without  disabilities.  Other  previous  research  has  identified 
a  modest  average  effect  size  (g+  =  0.065)  for  ABRA  on  the 
reading  comprehension  skills  of  children  without  disabilities 
(Abrami  et  al.,  2015).  Such  comparisons  suggest  that  children  with 
ASD  may  be  more  receptive  to  ABRA  instruction  relative  to 
children  without  disabilities.  However,  there  are  important  differ¬ 
ences  in  the  way  the  ABRA  literacy  instruction  was  delivered  in 
the  current  study  versus  previous  studies.  For  instance,  the  current 
study  evaluated  the  effects  of  ABRA  instruction  administered  on  a 
1 : 1  basis  whereas  previous  research  has  focused  exclusively  on  the 
effects  of  ABRA  instruction  in  small  groups  or  whole  class  set¬ 
tings.  Thus,  comparison  of  effect  sizes  obtained  from  the  current 
study  and  previous  research  should  be  carefully  considered. 

Numerous  features  within  the  ABRA  program  could  potentially 
benefit  children  with  ASD.  Broadly  speaking,  it  is  posited  that 
these  features  could  contribute  to  the  effectiveness  of  ABRA  via 
children’s  improved  engagement  with  instructional  content,  in¬ 
creased  access  to  learning  opportunities,  and  enhanced  generaliza¬ 
tion  of  learned  skills  across  instructional  contexts. 

Engagement.  ABRA  may  serve  to  enhance  the  willingness 
and  ability  of  children  with  ASD  to  engage  with  instructional 
content.  ABRA  sessions  follow  a  set  structure,  which  would  ap¬ 
pear  well-suited  to  the  needs  of  children  with  ASD  in  that  they  are 


commonly  found  to  show  a  preference  for  repetition  and  predict¬ 
ability  (Richler,  Huerta,  Bishop,  &  Lord,  2010).  The  interactive 
interface  of  ABRA’s  computer  tasks  is  considered  beneficial  in 
that  it  requires  children  to  actively  engage  with  and  respond  to 
instruction.  Active  cognitive  processing,  such  as  that  facilitated  by 
ABRA,  is  critical  to  learning  (Wouters,  Paas,  &  van  Merrienboer, 
2008).  ABRA  activities  also  occur  within  the  context  of  an  over¬ 
arching  storyline.  Embedding  learning  activities  within  a  broader 
narrative  in  this  way  may  enhance  intrinsic  motivation  for  learning 
(Baranowski,  Buduy,  Thompson,  &  Baranowksi,  2008),  and  assist 
in  the  creation  of  an  immersive  learning  environment  (Dickey, 
2006).  Therefore  the  features  of  ABRA  may  help  to  reduce  the 
difficulties  some  children  with  ASD  have  in  engaging  with  in¬ 
structional  content. 

Accessibility.  ABRA  instruction  may  promote  the  ability  of 
children  with  ASD  to  access  valuable  literacy  learning  opportuni¬ 
ties  in  several  ways.  First,  learning  objectives  and  task  difficulty 
settings  are  tailored  to  ensure  that  each  individual  child  com¬ 
mences  instruction  at  an  appropriate  level  and  experiences  the  high 
rates  of  accurate  responding  necessary  for  efficient  learning  (La¬ 
mella  &  Tincani,  2012).  Second,  key  reading  skills  and  their 
associated  learning  tasks  are  introduced  via  animated  video.  This 
permits  the  use  of  visually  cued  instructions,  the  likes  of  which 
have  been  shown  to  benefit  children  with  ASD  (Quill,  1997). 
Third,  many  ABRA  activities  provide  structured  feedback  using  a 
system  of  least-to-most  prompts.  This  form  of  feedback  appears 
well-suited  to  children  with  ASD,  many  of  whom  prefer  routine 
and  may  be  averse  to  unpredictable  feedback  (Hume,  Plavnick,  & 
Odom,  2012).  Considered  collectively,  these  features  are  proposed 
to  function  in  such  a  way  as  to  assist  children  with  ASD  to  access 
ABRA’s  instructional  content  despite  their  often  considerable 
social-communicative,  cognitive  and  behavioral  difficulties. 

Generalization.  Our  pre  and  post  instruction  testing  utilized 
standardized  assessments  that  were  created  independently  of 
ABRA.  Results  revealed  improvements  for  the  instruction  group 
relative  to  the  wait-list  control  group.  Thus,  ABRA  instruction 
appeared  to  generalize  to  a  broader  set  of  reading  materials. 
ABRA’s  multimodal  instructional  approach  may  encourage  the 
generalization  of  learnt  reading  skills  for  children  with  ASD  in  two 
ways.  First,  discrete  reading  skills,  which  are  initially  taught  in 
isolation,  are  explicitly  integrated  into  passage  level  reading  tasks 
involving  both  decoding  and  reading  comprehension.  This  form  of 
embedded  instruction  may  serve  to  enhance  both  the  development 
of  discrete  skills  and  the  abilities  of  children  with  ASD  to  inde¬ 
pendently  apply  these  skills  during  novel  tasks  (Smith,  Spooner,  & 
Wood,  2013).  Second,  ABRA  sessions  are  structured  in  such  a  way 
as  to  ensure  that  reading  skills  are  targeted  using  both  computer 
and  noncomputerized  learning  tasks.  The  use  of  multiple  mediums 
is  proposed  to  aid  in  the  development  of  generalized  reading  skills 
in  children  with  ASD,  many  of  whom  are  shown  to  have  difficulty 
generalizing  learned  skills  across  instructional  contexts  (Hume, 
Loftin,  &  Lantz,  2009). 

Previous  CAI  Research 

The  current  study  addressed  some  of  the  limitations  in  the 
previous  research  on  CAI  and  ASD.  These  limitations  include  the 
use  of  small  samples  typically  comprised  of  higher  functioning 
children,  reliance  on  nonstandardized  outcome  measures,  and  use 
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of  CAI  programs  that  are  inaccessible,  expensive,  or  outdated.  We 
addressed  these  limitations  by  evaluating  the  effects  of  a  freely 
accessible,  computer-assisted  program  on  the  reading  skills  of  a 
relatively  large,  diverse  sample  of  children  with  ASD  using  stan¬ 
dardized  outcome  measures.  The  inclusion  of  standardized  out¬ 
come  measures  in  the  current  study  permitted  us  to  directly  com¬ 
pare  participants  of  different  ages  and  to  quantify  changes  in 
reading  ability  for  each  participant  with  ASD  from  pre-  to  postin¬ 
struction  with  reference  to  a  normative  sample. 

Previous  research  has  returned  mixed  results  regarding  the  ef¬ 
fects  of  CAI  on  the  reading  skills  of  children  with  ASD.  For 
example,  Williams  et  al.  (2002)  found  nonsignificant  gains  in  word 
reading  accuracy  for  children  with  ASD  who  received  instruction 
using  an  experimental  computer-based  literacy  program.  By  con¬ 
trast,  Tjus  et  al.  (1998)  reported  significant  improvements  in  the 
word  and  sentence  reading  accuracy  skills  of  children  with  ASD 
following  instruction  using  the  Delta  Messages  program,  with 
large  effects  (8„„  =  1.031).  Basil  and  Reyes  (2003)  also  identified 
improvements  in  reading  comprehension  for  a  child  with  ASD 
following  instruction  using  the  Delta  Messages  program  but  did 
not  report  effect  sizes. 

It  is  important  to  note  that  the  types  of  CAI  investigated  in 
previous  research  have  differed  widely  in  both  instructional  focus 
and  mode  of  delivery.  For  example,  where  the  current  study 
utilized  a  balanced  reading  program  (i.e.,  targeting  both  code  and 
meaning  based  abilities)  delivered  using  a  web  application  and 
noncomputerized  extension  tasks,  Tjus  et  al.  (1998)  administered  a 
purely  computerized  intervention  targeting  only  sentence  construc¬ 
tion  skills.  The  instruction  protocols  employed  across  these  studies 
have  also  differed  in  intensity  and  duration,  ranging  from  a  few 
days  (e.g.,  Moore  &  Calvert,  2000)  to  several  months  (e.g., 
Bosseler  &  Massaro,  2003),  and  have  involved  divergent  samples 
of  children  with  ASD,  differing  widely  in  both  age  and  level  of 
functioning.  Given  these  inconsistencies,  it  is  difficult  to  directly 
compare  the  learning  outcomes  of  children  with  ASD  following 
exposure  to  the  various  CAI  programs.  However,  the  large  effects 
reported  in  the  current  study  suggest  that  ABRA  may  be  among  the 
more  effective  CAI  programs  for  teaching  reading  skills  to  chil¬ 
dren  with  ASD. 

Limitations  and  Future  Research 

While  the  findings  reported  in  the  current  study  are  encouraging, 
several  limitations  warrant  consideration.  First,  we  did  not  collect 
information  regarding  the  regular  classroom  literacy  instruction 
that  participants  received  during  the  instruction  period.  As  a  con¬ 
sequence,  it  is  not  possible  to  determine  whether  differences  in 
classroom  literacy  instruction  may  have  contributed  to  our  results. 
However,  given  that  participants  in  each  group  came  from  a 
number  of  different  districts,  we  think  it  unlikely  that  classroom 
instruction  could  have  had  a  systematic  effect  on  the  results. 
Second,  assessment  and  instruction  sessions  were  conducted  by  the 
first  author.  As  such,  it  is  possible  that  increased  rapport  may  have 
affected  the  performance  of  participants  in  the  instruction  group  at 
postinstruction  assessment.  However,  we  emphasize  that  pre  and 
post  instruction  assessment  utilized  standardized  measures  with 
strict  administration  procedures,  thereby  limiting  the  effect  of 
rapport.  Third,  external  measures  of  competence  fidelity  were  not 
collected.  It  is  therefore  unclear  whether  the  first  author  imple¬ 


mented  ABRA  with  a  high  degree  of  skill.  However,  there  was 
strong  evidence  of  context  and  compliance  fidelity. 

An  evaluation  of  ABRA  that  addresses  the  above  limitations 
with  a  larger  sample  of  children  with  ASD  is  encouraged.  A  larger 
study  would  benefit  from  incorporating  additional  outcome  mea¬ 
sures  (e.g.,  those  relating  to  nonword  decoding  skills)  and  could 
explore  the  effects  of  ABRA  on  different  subgroups  within  the 
ASD  population,  such  as  children  with  and  without  comorbid 
language  difficulties.  Future  studies  of  children  with  ASD  could 
also  evaluate  classroom-based  or  parent-directed  administration  of 
ABRA  as  well  as  the  use  of  ABRA  as  a  core,  as  opposed  to 
supplemental,  literacy  program. 

Conclusion 

The  current  study  is  the  first  to  evaluate  the  effects  of  ABRA 
instruction  on  children  with  ASD  and,  as  far  as  we  are  aware,  is  the 
first  investigation  of  ABRA  to  be  conducted  independently  of  the 
CSLP.  Our  findings  demonstrate  that  children  with  ASD,  like 
children  without  disabilities,  can  benefit  from  balanced  literacy 
instruction  that  targets  alphabetics,  reading  fluency,  reading  com¬ 
prehension,  and  writing  contained  within  the  ABRA  instruction 
program.  The  benefit  we  report  here  was  observed  across  three 
aspects  of  reading  ability:  word  level  accuracy,  passage  level 
accuracy,  and  passage  level  reading  comprehension.  In  short,  the 
freely  available,  computer-assisted  ABRA  program  shows  great 
promise  in  improving  reading  outcomes  for  children  with  ASD. 
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This  article  reports  on  a  meta-analysis  of  120  studies  (total  N  =  52,578;  782  effects)  examining  the 
relationship  between  creativity  and  academic  achievement  in  research  conducted  since  the  1960s. 
Average  correlation  between  creativity  and  academic  achievement  was  r  =  .22,  95%  Cl  [.19,  .24],  An 
analysis  of  moderators  revealed  that  this  relationship  was  constant  across  time  but  stronger  when 
creativity  was  measured  using  creativity  tests  compared  to  self-report  measures  and  when  academic 
achievement  was  measured  using  standardized  tests  rather  than  grade  point  average.  Moreover,  verbal 
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tests.  Theoretical  and  practical  consequences  are  discussed. 
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Is  there  a  relationship  between  creativity  and  academic  achieve¬ 
ment?  This  is  a  longstanding  and  largely  unresolved  question.  For 
more  than  half  a  century,  educators  and  psychologists  have  at¬ 
tempted  to  address  this  issue  (Cline,  Richards,  &  Abe,  1962; 
Mednick,  1963).  At  a  conceptual  level,  scholars  have  asserted  that 
creativity  and  learning  represent  interrelated  phenomena  (e.g., 
Beghetto,  2016a;  Guilford,  1967;  Piaget,  1962,  1981;  Sawyer, 
2012;  Vygotsky,  1967/2004).  Some  of  the  earliest  and  most  prom¬ 
inent  theorists  in  the  field  have  noted  this  link.  Guilford  (1967),  for 
instance,  asserted  that  creativity  and  learning  are  essentially  the 
same  phenomenon.  Vygotsky  (1967/2004)  similarly  argued  that 
the  creative  imagination  “is  a  completely  essential  condition  for 
almost  all  human  mental  activity”  (p.  17).  Another  example  is 
Piaget’s  theory  of  genetic  epistemology.  Indeed,  creativity  is  cen¬ 
tral  to  Piaget’s  theory  of  learning.  As  Gruber  (in  Bringuier,  1980) 
has  explained  in  reference  to  Piaget’s  theory,  “The  child  does  not 
learn  simply  what  the  adult  tells  him,  he  reinvents.  It’s  a  kind  of 
creativity”  (p.  67). 

Regardless  of  the  theoretical  stance  one  takes  on  learning — 
be  it  behavioral,  cognitive,  constructivist,  situated,  sociomate¬ 
rial,  or  some  other  theoretical  orientation — creativity  and  learn¬ 
ing  share  fundamental  similarities.  Indeed,  both  creativity  and 
learning  involve  change.  More  specifically,  creativity  refers  to 
new  and  meaningful  changes  in  thoughts,  products,  and  actions 
(Beghetto,  2016a;  Sternberg,  1999).  Similarly,  learning  repre- 
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sents  relatively  stable  changes  in  understanding  and  behavior 
(Alexander,  Schallert,  &  Reynolds,  2009).  Moreover,  both 
learning  and  creativity  can  be  viewed  as  processes  and  products 
(e.g.,  Alexander  et  al.,  2009;  Beghetto,  2016a;  Donovan  & 
Bransford,  2005;  Mumford,  Medeiros,  &  Partlow,  2012;  Wal¬ 
las,  1926).  It  is  therefore  possible  to  say  that  “a  creative  act  [as 
a  product]  is  an  instance  of  learning  [as  a  process],  for  it 
represents  a  change  in  behavior”  (Guilford,  1950,  p.  446). 
Along  these  same  lines,  it  is  also  possible  to  say  learning  (as  a 
product)  is  a  creative  process,  because  it  results  from  new  and 
personally  meaningful  changes  in  one’s  prior  understanding 
(Beghetto,  2016a).  Given  the  theoretical  links  between  creativ¬ 
ity  and  learning,  it  seems  reasonable  to  assume  that  there  would 
be  a  positive  relationship  between  creativity  and  measures  of 
academic  achievement.  The  empirical  work  that  has  examined 
this  link,  however,  has  yielded  a  more  equivocal  picture.  Some 
researchers  have,  for  instance,  reported  positive  associations 
ranging  from  .10-.56  (Cicirelli,  1967;  Getzels  &  Jackson,  1962; 
Niaz,  Nunez,  &  Pineda,  2000;  Ohnmacht,  1966).  Others  have 
reported  little  or  no  association  (e.g.,  Edwards  &  Tyler,  1965; 
Grigorenko  et  al.,  2009).  Still  others  have  reported  negative 
associations  (e.g.,  Anderson,  White,  &  Stevens,  1969).  In  fact, 
some  researches  have  noted  all  three  patterns  within  the  same 
study  (e.g.,  Gralewski  &  Karwowski,  2012).  Consequently,  the 
best  that  can  be  said  about  whether  there  is  a  link  between 
creativity  and  academic  achievement  is  this:  It  depends. 

Why  might  this  be  the  case?  The  present  meta-analysis  endeav¬ 
ors  to  address  this  question.  More  specifically,  we  have  two 
primary  aims  for  our  study.  Our  first  goal  is  to  provide  an  average 
effect  size  of  the  relationship  between  creativity  and  academic 
achievement.  Our  second  goal  is  to  examine  the  potential  impact  of 
factors  that  may  moderate  the  relationship  between  creativity  and 
academic  achievement.  Although  there  are  examples  of  meta- 
analytic  studies  that  have  addressed  related  issues  (e.g.,  the  rela¬ 
tionship  between  creativity  and  intelligence;  see  Kim,  2005),  we 
are  not  aware  of  any  published  meta-analytic  studies  of  creativity 
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and  academic  achievement. 1  We  therefore  endeavor  to  shed  light 
on  the  mixed  findings  of  prior  research  by  providing  a  more  stable 
estimate  of  the  relationship  between  creativity  and  academic 
achievement  and  by  examining  factors  that  may  potentially  mod¬ 
erate  this  relationship. 

Creativity  and  Academic  Achievement 
Creativity 

Creativity  scholars  generally  agree  that  creativity  represents  a 
combination  between  originality,  novelty,  or  newness  and  useful¬ 
ness,  meeting  task  constraints,  or  meaningfulness  as  defined  within 
a  particular  sociocultural  and  historical  context  (Amabile,  1996; 
Kaufman  &  Beghetto,  2009;  Plucker,  Beghetto,  &  Dow,  2004; 
Simonton,  2012;  Sternberg  &  Lubart,  1999).  The  following  sim¬ 
plified  notation  captures  this  definition  (adapted  from  Beghetto  & 
Kaufman,  2014;  Simonton,  2012): 

C  =  O  X  TC 
[ - CONTEXT - ] 

In  the  above  notation,  C  refers  to  creativity,  O  refers  to  origi¬ 
nality,  and  TC  refers  to  task  constraints.  As  specified  by  this 
formulation,  creativity  is  a  multiplicative  combination  of  original¬ 
ity  and  task  constraints  as  situated  within  a  particular  context. 
Consequently,  something  that  is  original  (0=1)  but  does  not  meet 
contextually  defined  task  constraints  (TC  =  0)  could  be  called 
original  but  not  creative  (C  =  0).  Consider,  for  instance,  a  student 
taking  a  calculus  exam  who  produces  a  vivid  and  quite  stunning 
pencil  drawing  of  mathematical  symbols  transforming  into  doves 
(instead  of  solving  the  problem  presented  on  the  exam).  Such  a 
response  is  clearly  original,  but  it  would  not  be  considered  creative 
in  the  context  of  the  exam.  In  order  for  a  student’s  response  on  a 
calculus  exam  to  be  considered  creative,  it  would  need  to  represent 
a  novel  solution  to  the  problem  at  hand  (i.e.,  meet  the  task 
constraints). 

In  the  context  of  academic  learning,  creativity  can  be  thought  of 
as  occurring  at  both  a  subjective  (creativity  as  part  of  the  act  of 
learning)  and  an  intersubjective  (learning  as  a  creative  act)  level 
(Beghetto,  2016a).  At  the  subjective  level,  students  exercise  their 
creativity  by  developing  new  and  personally  meaningful  ideas, 
insights,  and  understandings  within  the  context  of  particular  aca¬ 
demic  constraints  (Beghetto,  2007;  Beghetto  &  Kaufman,  2007). 
At  the  intersubjective  level,  students  who  share  their  unique  and 
academically  accurate  insights  and  interpretations  can  also  con¬ 
tribute  to  the  learning  and  understanding  of  others  (Beghetto, 
2016a). 

In  this  way,  creativity  is  more  than  originality  (Beghetto,  2010), 
divergent  thinking  (Baer,  1993;  Beghetto,  2013;  Guilford,  1967; 
Runco,  1991),  or  vividness  of  imagination  (Dziedziewicz  &  Kar- 
wowski,  2015;  Jankowska  &  Karwowski,  2015).  It  also  involves 
deductive  and  inductive  thinking  (Dunbar,  1997;  Vartanian,  Mar- 
tindale,  &  Kwiatkowski,  2003;  Weisberg,  2006),  as  well  as  the 
ability  to  use  specific  problem-solving  strategies  to  generate  novel 
solutions  to  complex  and  ill-defined  problems  (Beghetto,  2016b; 
Finke,  Ward,  &  Smith,  1992;  Sternberg,  1998).  All  these  charac¬ 
teristics  are  important  for  the  acquisition  of  new  knowledge  and 
learning  (Greiff  et  al.,  2013).  In  this  way,  creativity  and  learning 
work  hand-in-hand  (e.g.,  Beghetto,  2016a;  Guilford,  1967;  Piaget, 


1981;  Vygotsky,  1967/2004).  It  therefore  seems  reasonable  to 
suggest  that  creativity  would  be  related  to  academic  achievement, 
which  is  conceptualized  as  the  outcome  of  learning. 

Academic  Achievement 

Academic  achievement  is  an  outcome  of  learning,  which  is 
typically  measured  by  classroom  grades,  classroom  assessments, 
and  external  achievement  tests.  Researchers  who  have  examined 
correlates  of  academic  achievement  have  identified  a  wide  array  of 
factors,  including  individual,  social,  and  sociocultural  influences 
(see  Hattie,  2009,  for  a  review).  Of  these,  student  characteristics 
play  one  of  the  broadest  and  most  influential  roles  in  explaining 
variations  in  academic  achievement.  Student  characteristics  repre¬ 
sent  a  highly  heterogeneous  dimension,  which  includes  personality 
(Chamorro-Premuzic  &  Fumham,  2003;  Poropat,  2009),  cognitive 
abilities  (e.g.,  Chamorro-Premuzic  &  Fumham,  2008;  Deary, 
Strand,  Smith,  &  Fernandes,  2007),  intensity  and  type  of  motiva¬ 
tion  (Di  Domenico  &  Fournier,  2015),  self-esteem  and  academic 
self-concept  (Marsh  &  Hau,  2004),  and  socioeconomic  factors 
(Johnson,  McGue,  &  Iacono,  ' 2007;  Sackett,  Kuncel,  Ameson, 
Cooper,  &  Waters,  2009). 

Creativity  is  yet  another  student  characteristic  that  shares  a 
conceptual,  albeit  equivocal,  link  with  academic  achievement.  As 
we  have  discussed,  researchers  have  reported  associations  that  are 
relatively  strong  (e.g.,  r  =  .41,  Maijoribanks,  1976  or  r  =  .66, 
Yeh,  2004),  modest  (e.g.,  r  =  .20,  McCabe,  1991),  null  (e.g.,  r  = 
.03,  Tatlah,  Aslam,  Ali,  &  Iqbal,  2012),  and,  in  some  cases, 
negative  (e.g.,  r  =  -.03,  Anderson  et  al.,  1969).  The  aim  of  the 
present  study  is  to  help  clarify  the  empirical  ambiguity  surround¬ 
ing  the  link  between  creativity  and  academic  achievement  by 
providing  a  stable  estimate  of  the  association  and  also  examine 
whether  and  how  potential  moderators  might  influence  that  asso¬ 
ciation. 

Potential  Moderators 

What  might  account  for  variations  in  the  relationship  between 
creativity  and  academic  achievement?  Researchers  who  have  ad¬ 
dressed  this  question  (e.g.,  Freund  &  Holling,  2008;  Gralewski  & 
Karwowski,  2012;  Vijetha  &  Jangaiah,  2010)  have  identified  sev¬ 
eral  moderating  factors  (see  Figure  1).  As  illustrated  in  Figure  1, 
those  factors  include  (a)  the  type  of  measurement  used,  (b)  grade 
level  of  participants,  (c)  the  decade  the  study  was  conducted,  and 
(d)  the  geographic  region  of  the  study.  In  the  sections  that  follow, 
we  briefly  describe  each  of  these  potential  moderators. 

Type  of  Measurement 

The  type  of  measurement  represents  one  of  the  most  clearly 
identifiable  moderators  of  the  empirical  relationship  between  cre- 

_  t 

1  An  anonymous  reviewer  brought  to  our  attention  an  unpublished  report 
(HaHiburton-Beatty  &  Simms,  2013)  that  reanalyzed  meta-analytic  data 
testing  the  impact  of  creativity  training  programs  on  school  achievement 
(presented  in  the  meta-synthesis  by  Hattie,  2009).  This  report,  however, 
has  different  scope  and  focus  than  our  present  study.  As  previously 
described,  our  analysis  focuses  on  the  relationship  between  creative  ability/ 
self-concepts  and  academic  achievement  (rather  than  the  impact  of  creative 
training  programs),  and  we  analyze  effects  reported  in  primary  source 
material  (rather  than  a  reanalysis). 
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Figure  1.  Potential  factors  influencing  creativity  and  achievement  relationship. 


ativity  and  academic  achievement.  Put  simply,  regardless  of  the 
conceptual  overlap  between  creativity  and  academic  achievement, 
the  degree  of  the  observed  relationship  between  creativity  and 
academic  achievement  will,  in  large  part,  be  determined  by  the 
amount  of  overlap  in  how  each  construct  is  measured.  Moreover, 
there  is  a  wide  array  of  methods  measures  that  can  and  have  been 
used  to  measure  both  constructs  (Freund  &  Holling,  2008; 
Gralewski  &  Karwowski,  2012).  With  respect  to  creativity,  this 
includes  everything  from  self-report  measures  to  more  objective 
creativity  tests.  To  further  compound  this  issue,  there  is  little 
consensus  in  the  field  of  how  to  best  measure  creativity  (see 
Freund  &  Holling,  2008;  Kaufman,  Plucker,  &  Baer,  2008). 

The  kinds  of  creativity  measures  typically  used  to  examine  the 
relationship  between  creativity  and  achievement  can  be  classified 
into  two  types:  self-report  and  more  objective  creativity  tests. 
Self-report  measures  tend  to  focus  on  beliefs  about  one’s  creative 
ability  (e.g.,  Karwowski  &  Lebuda,  2015;  Skager,  Klein,  & 
Schultz,  1967),  creative  activity  or  achievement  (mainly  invento¬ 
ries  measuring  the  intensity  of  declared  creative  behaviors  and 
activities  or  observable  creative  accomplishments,  e.g.,  Carson, 
Peterson,  &  Higgins,  2005;  Jauk,  Benedek,  &  Neubauer,  2014), 
and  indicators  of  creative  personality  (e.g.,  Naderi,  Abdullah, 
Aizan,  Sharir,  &  Kumar,  2009).  More  objective  creativity  tests 
tend  to  focus  on  divergent  thinking  skills  (i.e.,  the  ability  to 
generate  original  ideas).  These  include  tests  based  on  Guilford’s 
theory  (e.g.,  Toll,  1985),  the  Test  of  Creative  Thinking-Drawing 
Production  (TCT-DP)  by  Urban  and  Jellen  (Urban,  1991;  Kar¬ 
wowski  &  Gralewski,  2013),  the  Torrance  Test  of  Creative  Think¬ 
ing  (TTCT;  Clapham,  2004;  Torrance,  1968),  and  other  instru¬ 
ments  (e.g.,  the  Remote  Associates  Test,  Mednick,  1963,  or  the 
Sternberg  Triarchic  Abilities  Test,  Chooi,  Long,  &  Thompson, 

2014) . 

Creativity  tests  can  be  further  distinguished  by  modality:  verbal 
tests  (i.e.,  requiring  participants  to  provide  verbal  answers  to  the 
problems  provided;  e.g.,  the  TTCT  verbal,  Hansenne  &  Legrand, 
2012,  or  the  Verbaler  Kreativity-Test,  Rindermann  &  Neubauer, 
2004)  and  figural  tests  (i.e.,  requiring  participants  to  draw  the 
solution;  e.g.,  the  TCT-DP,  Gralewski  &  Karwowski,  2012,  or  the 
Test  of  Creative  Imagery  Ability,  Jankowska  &  Karwowski, 

2015) .  The  most  popular  divergent  thinking  tests  (e.g.,  TTCT)  can 


be  further  divided  into  dimensions  of  divergent  thinking  (i.e., 
fluency,  flexibility,  originality,  or  elaboration).  Previous  studies 
have  demonstrated  that  aspects  of  divergent  thinking  vary  in  their 
association  with  academic  achievement  (e.g.,  Auzmendi,  Villa,  & 
Abedi,  1996;  Feldhusen,  Treffinger,  Van  Mondfrans,  &  Ferris, 
1971).  We  therefore  explore  whether  these  different  dimensions 
influence  the  relationship  between  creativity  and  academic 
achievement  but,  given  the  limited  work  in  this  area,  have  no 
prediction  as  to  the  specific  strength  of  this  influence  (e.g.,  non¬ 
existent,  weak,  moderate,  strong). 

With  respect  to  academic  achievement,  researchers  have  also 
used  a  wide  array  of  methods  and  measures  to  examine  the 
relationship  with  creativity.  Similar  to  creativity  measures,  aca¬ 
demic  achievement  measures  can  be  classified  into  two  types: 
subjective  assessments  and  objective  tests.  Grade  point  averages 
(GPAs)  represent  the  most  common  type  of  subjective  measure 
used  in  studies  that  have  examined  the  link  with  creativity  (e.g., 
Chamorro-Premuzic,  2006;  Freund  &  Holling,  2008;  Gralewski  & 
Karwowski,  2012).  More  objective  tests  refer  to  any  externally 
constructed  tests  of  academic  subject  matter  knowledge  or 
achievement  (e.g.,  Tan,  Mourgues,  Bolden,  &  Grigorenko,  2014). 

Taken  together,  the  measures  of  creativity  and  academic 
achievement  typically  used  in  studies  that  have  examined  their 
relationship  tend  to  include  both  subjective  and  more  objective 
types  of  measurement.  Moreover,  creativity  measures  tend  to  focus 
more  on  assessing  divergent  thinking  skills  and  abilities  (e.g., 
generating  original  ideas),  whereas  academic  tests  tend  to  focus 
more  on  whether  students  can  meet  predetermined  task  expecta¬ 
tions  (e.g.,  accurately  solving  a  problem  in  mathematics).  Figure  2 
provides  a  visual  representation  of  where  creativity  and  academic 
achievement  tests  tend  to  place  their  emphasis. 

As  depicted  in  Figure  2,  these  areas  of  emphasis  map  onto  the 
conceptual  definition  of  creativity  (C-OX  TC),  with  creativity 
tests  tending  to  focus  on  the  originality  (O)  aspect  of  creativity  and 
measures  of  academic  achievement  tending  to  focus  on  meeting 
predetermined  task  constraints  (TC).  The  area  of  empirical  overlap 
between  these  measures  is  therefore  restricted  to  the  narrow  inter¬ 
section  between  O  and  TC.  We  therefore  might  expect  that  the 
empirical  relationship  between  creativity  and  academic  achieve¬ 
ment  is  constrained  by  the  types  of  measures  used  to  assess  these 
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Figure  2.  A  visual  representation  of  where  creativity  and  academic 
achievement  tests  place  emphasis. 


constructs.  It  is  unclear,  however,  how  various  types  of  measure¬ 
ment  used  in  previous  studies  affect  the  average  relationship 
between  creativity  and  academic  achievement.  We  therefore  en¬ 
deavor  to  shed  light  on  this  moderating  factor. 

Education  Stage 

Education  stage  is  another  potentially  moderating  factor  in  the 
relationship  between  creativity  and  academic  achievement.  Some 
of  the  earliest  creativity  research  in  classrooms  was  conducted  by 
Torrance  (1968),  who  documented  what  he  called  a  fourth-grade 
slump  (i.e.,  declines  in  creativity  in  the  transition  from  third  to 
fourth  grade).  Since  that  time,  researchers  have  demonstrated 
variability  in  creativity  scores  across  stages  of  education.  The 
relationship  between  the  imagination  of  children  starting  school 
and  their  achievements,  for  instance,  hardly  exists  at  all,  r  =  .02 
(Karwowski  &  Dziedziewicz,  2012).  However,  as  early  as  the  fifth 
grade,  that  relationship  has  been  found  to  be  more  substantial,  r  = 
.23  (Jankowska,  Gajda,  &  Karwowski,  2015;  Karwowski,  2015). 
In  yet  other  studies,  the  relationship  between  creativity  and 
achievement  in  elementary  school  students  has  been  found  to 
range  between  r  =  .08  (Gajda,  2008)  and  r  =  .39  (Awamleh,  A1 
Farah,  &  El-Zraigat,  2012).  Variations  have  also  been  found  in 
middle  grades,  r  —  .  1 8  (Rindermann  &  Neubauer,  2004)  and  high 
school,  ranging  from  r  =  .12  (Kim  &  Michael,  1995)  to  r  =  .21 
(Karwowski,  2005). 

Although  education  stage  seems  to  moderate  the  relationship 
between  creativity  and  achievement,  there  is  no  clear  pattern  or 
direction  that  can  be  expected  from  previous  findings.  As  such,  the 
present  study  aims  to  provide  a  more  stable  estimate  of  the  influ¬ 
ence  of  grade  level  on  the  relationship  between  creativity  and 
academic  achievement. 

Decade 

As  with  education  stage,  there  is  prior  empirical  work  suggest¬ 
ing  that  decade  may  influence  the  relationship  between  creativity 
and  academic  achievement.  Kim  (2011)  conducted  one  of  the 
largest  cross-sectional  studies  ( N  =  279,599)  that  examined  the 
pattern  of  creativity  scores  over  six  time  periods  (from  1966- 
2008).  Kim  summarized  her  findings  by  stating  that  creativity 
scores,  measured  by  the  TTCT,  are  “declining  overtime  among 


Americans  of  all  ages,  especially  kindergarten  through  third  grade, 
the  decline  is  steady  and  persistent,  from  1990  to  present,  and 
ranges  across  various  components  tested  by  the  TTCT”  (p.  293). 
When  taking  the  full  range  of  decades  into  account,  the  patterns 
demonstrate  more  variability  (including  periods  of  gain,  stagna¬ 
tion,  and  decline).  Moreover,  the  changes  from  one  sampled  time 
period  to  the  next  are  often  (but  not  always)  statistically  signifi¬ 
cant,  and  the  magnitude  of  the  effect  varies  from  small  to  large 
(depending  on  the  particular  component  of  the  TTCT  examined 
and  the  time  period  tested). 

Consequently,  we  expect  that  time  period  likely  will  have  some 
influence  on  the  relationship  between  creativity  and  academic 
achievement,  but  it  is  difficult  to  predict  the  direction  or  magnitude 
of  that  difference.  Our  analysis  will,  however,  allow  us  to  examine 
whether  studies  conducted  across  different  time  periods  moderate 
the  relationship  between  creativity  and  academic  achievement. 

Culture 

Finally,  we  expect  culture  to  play  a  moderating  role  in  the 
relationship  between  creativity  and  academic  achievement.  The 
direction  and  magnitude  of  that  difference,  however,  are  once 
again  difficult  to  predict.  Researchers  have  noted  that  conceptual¬ 
izations  of  creativity  can  and  do  differ  across  cultures  (Kaufman  & 
Sternberg,  2006;  Rudowicz,  2003).  Given  that  there  is  so  much 
variation  within  cultures  (Freund  &  Holling,  2008;  Gralewski  & 
Karwowski,  2012),  it  is  difficult  to  untangle  the  within  variation 
from  the  between  variation  in  previous  work.  As  such,  the  present 
study  aims  to  help  clarify  whether  and  to  what  extent  culture 
moderates  the  relationship  between  creativity  and  academic 
achievement. 

Method 

Search  Strategies 

We  followed  a  three-step  procedure  to  select  the  studies  in¬ 
cluded  in  our  meta-analysis.  The  first  step  was  a  review  of  articles 
and  research  papers  in  English.  We  searched  EBSCO,  PsycExtra, 
Academic  Search  Complete,  Psyclnfo,  PsycArticles,  and  ERIC 
databases  and  used  the  resources  of  JSTOR,  Science  Direct,  SAGE 
Journals,  Taylor  &  Francis,  and  ProQuest.  In  the  next  step,  we 
analyzed  book  publications  using  three  electronic  libraries:  Wiley 
Online  Library  and  Questia,  as  well  as  Google  Books. 

We  used  the  following  search  parameters  to  collect  articles 
(keywords,  abstracts,  titles,  and  full  text):  academic  achievement * 
or  school  grades*  or  school  achievement*  or  scholastic  achieve¬ 
ment*  or  grade  point  average  and  creative  ability*  or  creativity*  or 
divergent  thinking*.  Finally,  in  the  third  step  of  our  search  proce¬ 
dure,  we  explored  whether  any  additional  studies  could  be  found 
by  conducting  a  review  of  Polish-language  periodicals  devoted  to 
psychology  and  education.  We  chose  Polish-language  periodicals 
because  the  first  two  authors  had  access  to  this  literature  and  are 
fluent  in  the  language. 

Inclusion  and  Exclusion  Criteria 

Our  search  yielded  a  total  of  148  studies.  We  then  applied  the 
following  selection  criteria  to  those  studies.  First,  we  only  consid- 


CREATIVITY  AND  ACADEMIC  ACHIEVEMENT 


273 


ered  studies  that  presented  a  quantitative  measure  of  the  strength  of 
the  relationship  between  creativity  and  academic  achievement, 
even  if  the  relationship  between  creativity  and  academic  achieve¬ 
ment  was  not  the  primary  goal  of  the  study.  A  total  of  18  studies2 
did  not  meet  this  first  selection  criterion  and  were  eliminated  from 
the  analysis. 

Next,  we  only  included  studies  if  they  used  more  objective 
measures  of  creativity  (e.g.,  TTCT)  or  self-report  scales  that  dem¬ 
onstrated  adequate  reliability,  such  as  measures  of  creative  per¬ 
sonality  (e.g.,  Naderi  et  al.,  2009)  or  creative  self-confidence 
beliefs  (e.g.,  Skager  et  al.,  1967).  This  resulted  in  the  elimination 
of  four  studies.3  With  respect  to  academic  achievement,  we  in¬ 
cluded  studies  that  used  GPA  (e.g.,  Chamorro-Premuzic,  2006), 
external  examinations  (e.g.,  Tan  et  al.,  2014),  and  achievement 
tests  created  by  researchers  for  the  purpose  of  their  study  (e.g., 
Dobrotowicz,  2002;  Sethi,  2012).  This  resulted  in  the  elimination 
of  one  study  that  used  students’  self-assessments  of  academic 
achievement  (Kaltsounis,  1974). 

We  also  excluded  two  studies  that  used  data  presented  in  other 
publications,  one  study  that  used  data  previously  published  by  a 
different  author,  and  two  studies  that  used  multilevel  models.  The 
two  studies  that  used  multilevel  models  were  excluded  because 
they  provided  unstandardized  regression  coefficients  that  were 
inflated  by  the  control  of  nesting  students  into  classes  and  schools. 
Although  (3  values  are  sometimes  translated  into  r  values  (Peterson 
&  Brown,  2005),  there  is  no  widely  accepted  or  robust  procedure 
for  translating  coefficients  from  multilevel  models  into  standard¬ 
ized  effect  size  for  use  in  meta-analysis. 

A  total  of  120  of  the  original  148  studies  met  our  selection 
criteria  and  were  included  in  the  analysis.  Taken  together,  the 
included  studies  had  782  effects  with  over  50,000  participants 
(N  =  52,578).  Participants  had  a  mean  age  of  13.8  years  (SD  = 
2.43)  and  attended  elementary,  middle,  and  high  schools  as  well  as 
colleges  or  universities.  The  studies  were  conducted  between  1962 
and  2015,  in  various  countries  (including  the  United  States,  Euro¬ 
pean  countries,  Asia,  and  Africa).  Table  1  provides  a  detailed 
overview  of  the  studies  included  in  the  meta-analysis. 

Coding  Procedures 

The  first  two  authors  independently  coded  each  article  for 
relevant  information,  including  sample  size,  sample  selection,  ef¬ 
fect  size,  and  information  necessary  for  the  moderator  analyses 
(i.e.,  measures  of  creativity  and  academic  achievement,  partici¬ 
pants’  age  and  stage  of  education,  date  and  location  of  publica¬ 
tion).  Next,  we  reviewed  the  coded  data  and  articles,  as  well  as 
discussed  and  resolved  any  discrepancies  to  help  eliminate  errors 
in  coding. 

Moderators 

For  each  study  included  in  our  analysis,  we  coded  for  the  key 
moderators  of  interest.  With  respect  to  type  of  measurement,  we 
coded  the  type  of  creativity  measure  used  in  the  study  (i.e.,  creative 
ability  test  or  self-report  questionnaire).  We  distinguished  between 
different  types  of  creativity  tests,  including  tests  based  on  Guil¬ 
ford’s  theory,  the  TCT-DP  by  Urban  and  Jellen,  the  TTCT,  and 
other  instruments  (e.g.,  Remote  Associates  Test;  Mednick,  1963). 
We  also  coded  the  different  dimensions  of  creative  ability  mea¬ 


sured  by  tests  used  in  the  studies  (i.e.,  overall  indices  of  creative 
ability,  fluency,  flexibility,  originality  of  thinking,  and  elabora¬ 
tion).  With  respect  to  academic  achievement,  we  coded  for  how 
achievement  was  measured  (i.e.,  GPA  or  achievement  test)  and 
type  of  achievement  measured  (i.e.,  humanities,  science,  overall 
performance,  sports). 

Finally,  we  coded  (a)  education  stage  (i.e.,  elementary,  middle 
school,  high  school,  college/university),  (b)  study  year  (i.e.,  the 
year  the  study  was  conducted),  and  location  (i.e.,  the  country  or 
continent  where  the  study  was  conducted).  We  also  included  two 
dichotomously  coded  control  variables  that  might  influence  the 
relationship  between  creativity  and  academic  achievement.  Those 
control  variables  included  (a)  goal  of  the  study  (i.e.,  primary 
purpose  was  examining  the  relationship  between  creativity  and 
academic  achievement  vs.  another  goal)  and  (b)  publication  status 
(i.e.,  published  or  unpublished  study). 

Statistical  Methods 

When  possible,  we  computed  effect  size  using  the  values  of 
correlation  coefficients  (r)  and  sample  size  ( N ).  In  a  few  studies, 
however,  we  converted  the  effect  value  provided  (e.g.,  (3,  F,  or  \2) 
to  the  value  of  the  r  correlation  coefficient.  To  analyze  main 
effects,  we  used  multilevel  meta- analysis  (Cheung,  2014,  2015; 
Konstantopoulos,  2011;  Lebuda,  Zabelina,  &  Karwowski,  2015), 
because  individual  correlations  were  clustered  within  studies.  We 
carried  out  a  three-level  meta-analysis.  Level  1  related  to  the 
participants  in  individual  studies,  Level  2  to  interdependent  effects 
within  independent  studies,  and  Level  3  to  the  studies  themselves. 

Three-level  meta-analysis  made  it  possible  to  obtain  robust 
estimates  of  effect  size,  specifically  unbiased  estimates  of  standard 
errors,  Level  2  (within-study)  variance,  and  Level  3  (between- 
study)  variance.  Three-level  meta-analysis  was  required  because 
averaging  the  effects  of  individual  studies  would  have  significantly 
weakened  the  power  of  the  entire  analysis  (we  had  782  effects,  but 
these  were  drawn  from  1 20  studies)  and  would  not  have  allowed  us 
to  estimate  the  influence  of  various  moderators  (as  these  were 
attributed  to  specific  effects  rather  than  studies). 


2  Those  18  studies  focused  on  analyzing  the  theory  of  positive  disinte¬ 
gration  (Gallagher,  1985);  the  effectiveness  of  training  influences 
(Blumen-Pardo,  2002;  Cheung,  Roskams,  &  Fisher,  2006;  Malekian  & 
Fathi,  2012;  Yorke-Viney,  2007);  the  analysis  of  success  in  teaching 
(Hodder,  1972);  the  analysis  of  the  relationship  of  sibling  structure  with 
creativity,  intelligence,  and  academic  achievement  (Cicirelli,  1967);  inves¬ 
tigating  the  predictors  of  entrepreneurship  (Farzaneh  et  al.,  2010);  seeking 
various  predictors  of  academic  achievement  (Childs,  1978;  Muhich,  1972; 
Owen,  Feldhusen,  &  Thurston,  1970;  Richards  &  Casey,  1975;  Yamamoto, 
1964);  analyzing  the  relationship  between  parenting  style,  perfectionism, 
and  creativity  in  talented  individuals  with  high  academic  achievement 
(Miller,  Lambert,  &  Speirs  Neumeister,  2012);  the  teacher’s  perception  of 
creativity,  intelligence,  and  academic  achievement  (Mayfield,  1979);  the 
measurement  of  creativity,  intelligence,  and  academic  achievement  (Eisen- 
man,  Platt,  &  Darbes,  1968);  analyzing  the  reliability  and  validity  of 
ideational  originality  (Runco  &  Albert,  1985);  and  creativity  in  exact 
sciences  (Son,  2009). 

3  Those  four  studies  included  a  single  self-report  question  from  a  ques¬ 
tionnaire  as  a  measure  of  creativity  (Unal  &  Demir,  2009),  judges’  rating 
of  participants  with  low-reliability  products  (Hasirci  &  Demirkan,  2007; 
Priest,  2006),  and  a  questionnaire  completed  by  the  teacher  concerning 
students’  creativity  level  (Baltzer,  1988). 


Table  1 

The  Subjects  of  the  Studies  Included  in  the  Meta-Analysis 
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Vote.  GPA  =  grade  point  average;  TTCT  =  Torrance  Test  of  Creative  Thinking;  SAT  =  Stanford  Achievement  Tests;  ACT  =  American  College  Testing;  CAB  5  =  Comprehensive  Ability  Battery; 
TCT-DP  =  Test  of  Creative  Thinking-Drawing  Production;  CBEST  =  California  Basic  Educational  Skills  Test. 
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The  multilevel  meta-analysis  was  conducted  using  the  meta- 
SEM  package  (Cheung,  2014,  2015)  in  the  R  statistical  environ¬ 
ment  (R  Development  Core  Team,  2013).  When  analyzing  the 
effect  of  publication  bias,  we  also  used  the  Comprehensive  Meta- 
Analysis  package  (Biostat,  2008),  the  metafor  package  in  R 
(Viechtbauer,  2010),  and  p-curve  (Simonsohn,  Nelson,  &  Sim¬ 
mons,  2014). 


Results 

We  present  the  results  of  the  meta-analysis  in  three  steps.  First, 
we  present  a  general  estimation  of  the  effect  size  obtained  in  the 
multilevel  model  and  in  the  random-effects  model.  Next,  we 
analyze  the  potential  influence  of  publication  bias,  which  helps 
determine  the  robustness  of  the  obtained  effect  size.  Finally,  in 
further  multilevel  models,  we  present  the  results  of  our  moderator 
analyses. 
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Figure  3.  A  funnel  plot  assessing  the  possible  publication  bias. 


Overall  Effect 

Table  2  presents  the  overall  effect  of  the  relationship  between 
creativity  and  academic  achievement.  The  obtained  mean  effect 
size  was  consistent  with  our  expectations.  More  specifically,  there 
was  a  positive  and  statistically  significant,  albeit  modest,  relation¬ 
ship:  r  =  .22,  95%  Cl  [.19,  .24].4  As  expected,  this  effect  was  also 
heterogeneous,  Q{df  =  781)  =  9,481.65,  p  <  .001.  Both  within- 
study  variance  (between  particular  effects)  and  between-study 
variance  were  statistically  significant,  with  most  of  the  variance 
being  between  ( / 2  =  .62)  rather  than  within  studies  ( / 2  =  ,30).5 
Prior  to  examining  the  influence  of  moderators,  however,  we 
examined  to  what  extent  the  obtained  effect  may  be  influenced  by 
publication  bias. 

Publication  Bias 

We  analyzed  the  robustness  of  the  obtained  effect  size  by 
examining  whether  it  was  influenced  by  publication  bias.  We  used 
a  four-step  process  that  included  both  classic  and  more  recent 
methods  of  analysis.  First,  we  used  a  funnel  plot  (Duval  & 
Tweedie,  2000)  with  several  nonparametric  techniques  to  estimate 
possible  bias.  We  next  used  a  p-curve  analysis  (Simonsohn  et  al., 
2014)  and  then  estimated  the  effect  of  using  PET-PEESE6  (Stanley 
&  Doucouliagos,  2014).  Finally,  we  compared  effect  sizes  ob¬ 
tained  in  published  versus  unpublished  studies. 

An  inspection  of  the  funnel  plot  (see  Figure  3)  does  not  suggest 
asymmetry  (i.e.,  correlations  on  one  side  of  the  funnel  do  not  seem 

Table  2 

Overall  Effect  Size  Obtained  Using  Three-Level  Meta-Analysis 

95%  Cl 

Effects  Estimate  SE  LL  UL  p 


Fixed  effect 


Overall  effect 

Random  effects 

.215 

.015 

.187 

.244 

<.001 

Within-study  variance 

.010 

.001 

.008 

.011 

<.001 

Between-study  variance 

.020 

.003 

.013 

.026 

<.001 

Note.  Number  of  studies  =  120,  number  of  effects  =  782,  total  N  = 
52,578.  Cl  =  confidence  interval;  LL  =  lower  limit;  UL  =  upper  limit. 


to  be  regularly  suppressed  by  the  effects  on  the  other  side).  This 
pattern  suggests  a  lack  of  publication  bias  (although  such  an 
interpretation  is  based  more  on  a  qualitative  judgment,  rather  than 
strict  statistical  rules). 

To  assist  with  the  interpretation  of  the  funnel  plot,  research¬ 
ers  conducting  meta-analyses  often  include  statistical  analysis. 
We  used  Egger’s  regression  intercept  test  (Egger,  Davey  Smith, 
Schneider,  &  Minder,  1997).  Based  on  the  random  effects 
model,  assessing  funnel  plot  asymmetry,  and  Begg  and  Mazum- 
dar  (1994)  rank  correlation  test  (nonsignificant  ps  =  .42  and 
.30,  respectively),  we  concluded  there  was  no  evidence  of 
publication  bias. 

We  next  performed  a p-curve  analysis7  (Simonsohn  et  al.,  2014) 
to  examine  the  credibility  of  the  estimate  using  the  online  appli¬ 
cation  available  at  http://www.p-curve.com/.  The  results  of  the 
p-curve  analysis  (see  Figure  4)  provided  no  evidence  of  a  “file- 


4  Robustness  check  performed  using  Comprehensive  Meta- Analysis 
software  (Biostat,  2008)  on  averaged  effects  for  studies  revealed  the 
existence  of  an  identical  relationship.  Due  to  high  heterogeneity  ( Q  = 
892.61,  df  =  1 19,  p  <  .001,  / 2  =  86.67%),  we  performed  analyses  using 
the  random-effects  model,  in  which  we  obtained  a  mean  correlation  of  r  = 
.22,  95%  Cl  [.19,  .24],  and  a  high  degree  of  heterogeneity,  t2  =  .015,  t  = 
.12. 

5  The  relatively  low  within-study  variance  compared  to  between-study 
variance  suggests  that  an  equally  good  analytic  choice  could  have  been 
meta-analysis  using  the  random-effects  method  on  data  aggregated  to  the 
level  of  individual  studies.  However,  we  chose  multilevel  analysis  per¬ 
formed  at  the  level  of  individual  correlations  (with  correlation  grouping  in 
studies  controlled  for),  because  some  of  the  possible  moderators  clearly 
had  a  within-study  character  (e.g.,  the  operationalization  of  creative  abil¬ 
ities  as  the  fluency,  flexibility,  and  originality  of  thinking). 

6  This  method  fits  a  meta-regression  model  predicting  effect  sizes  in 
studies  by  their  variances  (the  precision  effect  test,  called  PET)  or  their 
standard  errors  (the  precision  effect  estimate  with  standard  errors,  called 
PEESE).  If  the  intercept  is  statistically  significant  in  the  PET  model,  the 
PEESE  model  should  be  taken  into  account  as  the  publication  bias-free 
effect  size. 

7  The  p-curve  analysis  focuses  only  on  statistically  significant  studies 
(i.e.,  all  effects  below  significance  level  are  excluded)  and  checks  whether 
“just  significant  effects”  (i.e.,  slightly  lower  than  p  =  .05  or  between  p  = 
.04  and  p  =  .05)  are  not  overrepresented  in  the  analyzed  studies.  Such 
overrepresentation  may  be  caused  not  only  by  publication  bias  but  also  by 

cherry-picking,  p-hacking,”  or  other  questionable  research  practices 
(Simonsohn  et  al.,  2014). 
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Figure  4.  The  p-curve  analysis  of  publication  bias.  (A)  The  p-curve  analysis  and  (B)  its  robustness.  See  the 
online  article  for  the  color  version  of  this  figure. 


drawer  effect”  (i.e.,  most  studies  provided  highly  significant  re¬ 
sults)  and  there  was  also  no  overrepresentation  of  “just  significant 
effects”  (i.e.,  slightly  lower  than  p  =  .05  or  between  p  =  .04  and 
p  =  .05;  Figure  4A).  Even  more  important,  p-curve  analysis 
demonstrated  that  the  obtained  effects  were  quite  robust  and  in¬ 
sensitive  to  the  exclusion  of  subsequent  studies  with  the  highest  p 
values  (Figure  4B). 

Moreover,  the  continuous  test  for  a  right-skewed  curve  (i.e., 
examining  whether  studies  contain  evidential  value)  was  statisti¬ 
cally  significant  (z  =  —30.78,  p  <  .0001),  whereas  testing  for 
left-skewed  studies  (i.e.,  those  that  exhibit  evidence  of  p-hacking) 
did  not  yield  significant  results  (p  >  .999).  Taken  together,  the 
results  of  our  p-curve  analysis  provided  further  evidence  that  there 
was  not  an  influence  of  publication  bias. 

The  third  step  in  the  analysis  of  publication  bias  was  the 
creation  of  a  model  based  on  the  PET-PEESE  method.  Because 
the  intercept  obtained  using  PET  was  statistically  significant 
(r  =  .219,  SE  =  .016,  p  <  .001),  we  adopted  the  intercept 
obtained  based  on  PEESE  as  a  measure  of  effect  size  not 
affected  by  publication  bias,  as  recommended  by  Stanley  and 
Doucouliagos  (2014).  The  obtained  effect  was  nearly  the  same 
as  the  results  reported  before  (i.e.,  r  =  .215,  95%  Cl  [.192, 
.239],  p  <  .001),  which  also  suggests  no  evidence  of  publica¬ 
tion  bias. 

Finally,  the  results  of  comparing  the  effects  obtained  in  pub¬ 
lished,  r  =  .23,  95%  Cl  [.20,  .27],  versus  unpublished  studies,  r  = 
.19,  95%  Cl  [.15,  .22],  revealed  a  marginal  difference  in  favor  of 
published  studies  ( Q  =  3.86;  df  =  l,p  =  .05),  but  the  similar  size 
of  the  estimated  effects  and  the  overlapping  confidence  intervals 
make  it  legitimate  to  conclude  that  publication  bias  did  not  sub¬ 
stantively  influence  our  estimations. 

Moderator  Analysis 

We  analyzed  the  role  of  moderators  in  a  sequence  of  multilevel 
regression  models  and  used  the  measures  of  the  baseline  model  s 
fit  {-ILL  =  -794.92,  df  =  3)  obtained  in  our  analysis  of  the 
overall  effect  to  compare  models  with  moderators  included  as 
predictors.  This  approach  allowed  us  to  control  for  the  mutual 
associations  between  predictors.  At  the  end  of  this  section,  we  also 


present  estimations  obtained  using  the  analysis  of  variance 
(ANOVA)  analog  (Wilson,  2014),  performed  at  the  study  level. 
Although  not  as  statistically  robust  as  multilevel  models,  the 
ANOVA  analog  analysis  provides  estimated  effects  for  different 
groups  of  studies  in  a  more  convenient  and  easier  to  interpret 
fashion. 

Types  of  measurement  and  study  year.  In  the  first  step,  we 
entered  three  moderators  representing  the  measurement  of  creativ¬ 
ity  (0  =  self-report,  1  =  test),  academic  achievement  (0  =  test,  1 
=  GPA),  and  study  year  (grand  centered).  We  also  included  two 
control  variables  in  Step  1:  research  objective  (0  =  other,  1  = 
creativity — achievement)  and  publication  status  (0  =  unpublished , 
1  =  published). 

This  model  demonstrated  better  fit  to  the  data  (—2 LL  =  —813.61, 
df  —  8,  A-2 LL  =  1 8.69,  A  df  =  5,  p  =  .002)  compared  to  our  baseline 
model.  The  results  are  presented  in  Table  3. 

As  displayed  in  Table  3,  the  predictors  entered  in  the  model 
explained  11%  of  between-study  variance  and  2.2%  of  within- 
study  variance.  The  obtained  effects  were  stronger  when  cre¬ 
ativity  was  measured  using  tests  compared  to  when  it  was 
measured  using  self-report  scales,  as  well  as  stronger  for  aca¬ 
demic  achievement  measured  using  standardized  tests  com¬ 
pared  to  using  GPA.  With  respect  to  study  year,  there  was  no 
significant  influence  on  the  obtained  effect  size,  suggesting  that 
the  correlations  were  stable  across  time  (see  Figure  5).  Finally, 
the  two  control  variables  (i.e.,  research  objective  and  publica¬ 
tion  status)  were  not  significantly  related  to  effect  size. 

In  the  second  step,  we  removed  nonsignificant  predictors 
from  the  model  (research  objective,  publication  status,  study 
year)  and  added  variables  specifying  the  location  of  study  (with 
Europe  as  the  reference  value)  and  type  of  achievement  mea¬ 
sured  (i.e.,  performance  in  the  humanities,  in  sciences,  and 
overall  performance,  with  sport  as  the  reference  value).  This 
model  did  not  fit  the  data  significantly  better  than  the  previous 
model  (  —  2  LL  =  -819.37,  df  =  13,  A-2  LL  =  5.76,  A  df  =  5, 
p  =  .33).  Given  that  these  additional  moderators  did  not  influ¬ 
ence  the  obtained  effects,  our  results  indicate  that  the  relation¬ 
ship  between  creativity  and  achievement  was  stable  regardless 
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Table  3 


Moderator  Analysis:  Types  of  Measurement  and  Study  Year 


95%  Cl 

Effects 

Estimate 

SE 

LL 

UL 

P 

Fixed  effects 

Intercept 

.119 

.040 

.041 

.198 

.003 

Creativity  measurement  (0  =  self-report,  1  =  test) 

.097 

.028 

.042 

.153 

.001 

Academic  achievement  measurement  (0  =  test,  1  =  GPA) 

-.039 

.018 

-.074 

-.004 

.03 

Study  year  (grand  centered) 

.0002 

.001 

-.001 

.002 

.76 

Goal  (0  =  other,  1  =  creativity  X  achievement) 

-.003 

.029 

-.060 

.054 

.91 

Published?  (0  =  no,  1  =  yes) 

Random  effects 

.054 

.030 

-.005 

X  .114 

.07 

Within-study  variance 

.009 

.001 

.008 

.011 

<.001 

Between-study  variance 

.018 

.003 

.012 

.023 

<.001 

Note.  Number  of  studies  =  120,  number  of  effects  =  782,  total  N  =  52,578.  Cl  =  confidence  interval;  LL  = 
lower  limit;  UL  =  upper  limit;  GPA  =  grade  point  average. 


of  location8  where  the  study  was  conducted  and  regardless  of 
domain  of  achievement  examined.  Moreover,  given  that  these 
additional  moderators  were  not  significant,  we  do  not  provide 
detailed  results  of  Step  2  of  the  analysis  (but  interested  readers 
can  find  those  results  in  the  online  supplemental  material  Table 
SI). 

Education  stage.  The  next  step  took  into  account  the  possi¬ 
bility  of  effects  being  influenced  by  the  participants’  education 
stage.  We  used  a  different  model  for  examining  this  moderator 
because  eight  studies  (and  154  correlations)  used  samples  that 
combined  participants  from  elementary  and  middle,  elementary 
and  high,  or  middle  and  high  schools.  Thus,  we  removed  those 
eight  studies  from  this  step  and  conducted  our  analysis  using  a 
model  that  included  a  total  of  628  effects  from  112  studies  (see 
Table  4). 

The  results  of  multilevel  regression,  using  elementary  school 
students  as  the  reference  category,  indicated  that  the  effect  ob¬ 
served  for  middle  school  students  was  significantly  higher  than  the 
effect  for  elementary  students  ( B  =  0.12 ,  SE  =  0.05;  p  =  .015). 
The  effect  sizes  obtained  for  high  school  and  university/college 
students  did  not  differ  significantly  from  the  effect  obtained  for 
elementary  school  students. 

Aspects  of  creativity  tests.  Given  that  we  found  consistently 
stronger  associations  between  creativity  and  academic  achieve¬ 
ment  obtained  in  studies  where  creativity  was  measured  using  tests 
compared  to  self-report,  we  conducted  a  more  focused  analysis  on 
studies  that  used  creativity  tests  (i.e.,  106  studies,  700  effects).  The 
overall  effect  obtained  only  in  those  studies  was  r  =  .23,  SE  —  .016, 
95%  Cl  [.20,  .26],  with  a  significant  level  of  heterogeneity,  Q{df  = 
699)  =  8,145.81,  p  <  .001,  situated  mainly  between  studies,  I2  =  .60, 
rather  than  within  them,  / 2  =  .31,  -2 LL  =  -676.70,  df  =  3. 

Therefore,  in  the  next  model,  in  addition  to  the  method  of 
measuring  academic  achievement,  we  included  four  more  specific 
moderators  in  the  group  of  creativity  test  predictors — namely, 
fluency,  flexibility,  originality  of  thinking,  elaboration,  and  overall 
creative  ability  (e.g.,  the  sum  of  TTCT  or  TCT-DP  scores)  and 
other  measures  (e.g.,  imagination  as  measured  by  Jankowska  & 
Karwowski,  2015).  We  used  a  combination  of  overall  indices  of 
creative  ability  and  other  measures  as  the  reference  category  for 
our  analysis.  This  model  did  not  fit  the  data  better  than  the 
previously  tested  model  (-2 LL  =  -681.35,  df  =  8;  A-2LL  = 


4.65,  A  df  =  5;  p  =  .46).  Moreover,  the  various  aspects  of  creative 
ability  (fluency,  flexibility,  originality,  elaboration)  did  not  differ 
from  the  reference  category  in  terms  of  the  effect  size  generated 
(see  the  online  supplemental  material  Table  S3). 

Next,  we  examined  whether  the  verbal  or  figural  characteristics 
of  the  creativity  test  resulted  in  different  obtained  effects.  For  this 
analysis,  from  the  total  pool  of  studies  using  creativity  tests  (106 
studies,  700  effects),  we  excluded  16  studies  whose  authors  did  not 
provide  separate  results  for  verbal  and  figural  tests  (e.g.,  Anwar, 
Aness,  Khizar,  Naseer,  &  Muhammad,  2012;  Porter,  1974;  Za¬ 
belina,  Condon,  &  Beeman,  2014).  The  observed  effect  was  there¬ 
fore  estimated  on  a  total  of  90  studies.  The  results  of  this  model  are 
presented  in  Table  5. 

As  depicted  in  Table  5,  the  average  effect  size  estimated  on  617 
correlations  did  not  differ  significantly  from  the  overall  effect 
previously  reported:  r  =  .228,  SE  =  .017,  95%  Cl  [.194,  .262], 
p  <  .001;  Q{df  =  616)  =  7,520.86,/?  <  .001;  I2  between  studies  = 
.595;  1 2  within  studies  =  .322  {-ILL  =  -577.52,  df  =  3).  This 
model — which  examined  test  type  (0  =  figural ,  1  =  verbal ),  in 
addition  to  the  previously  examined  moderators  and  controls — fit 
the  data  better  than  the  previously  tested  model 
{-ILL  =  -685.00,  df  =  15;  A-2  LL  =  107.48,  A  df  =  12;  p  < 
.001).  Moreover,  the  results  of  this  analysis  indicate  that  verbal 
tests  of  creativity  generated  significantly  higher  effects  than  figural 
tests. 


8  An  anonymous  reviewer  questioned  our  decision  to  include  studies 
published  in  languages  other  than  English  (especially  Polish  but  also 
Lithuanian).  Specifically,  the  reviewer  recommended  that  we  exclude  these 
studies  as  they  may  cause  difficulty  for  those  who  want  to  replicate  our 
study.  Ultimately,  we  decided  to  keep  these  non-English  studies  in  our 
analysis  for  three  reasons.  First,  eliminating  them  would  reduce  the  statis¬ 
tical  power  of  our  meta-analysis  to  88  studies.  Second,  our  additional 
analyses  (see  the  online  supplemental  material  Table  S2)  showed  that 
although  studies  published  in  Polish  and  Lithuanian  yielded  significantly 
lower  effect  size,  r  =  .14,  95%  Cl  [.10,  .18],  than  studies  published  in 
English,  r  =  .24,  95%  Cl  [.21,  .27],  this  effect  was  caused  by  the  fact  that 
nonverbal  tests  were  more  often  used  in  Poland  and  Lithuania,  not  by  the 
country  itself.  When  we  controlled  for  the  type  of  the  test,  the  effect  of 
country  was  no  longer  significant  (p  =  .44).  Hence,  we  decided  to  analyze 
all  obtained  effects.  Finally,  we  are  making  available  the  raw  data  and  R 
scripts  to  researchers  interested  in  replicating  our  analyses,  available  here: 
https://osf.io/zhr8v/. 
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Figure  5.  The  relations  between  study  year  and  effect  size. 


Finally,  in  an  effort  to  provide  a  summary  of  estimated  effects 
of  the  moderators,  we  conducted  a  meta-analysis  analog  of 
ANOVA9  using  the  estimations  obtained  at  the  study  level.  Results 
of  that  analysis  are  presented  in  Table  6.  As  with  our  previously 
reported  findings,  the  results  of  this  analysis  indicate  that  the 
observed  effect  was  stable  across  time  (similar  in  concurrent 
decades)  but  moderated  by  the  type  of  creative  test  used.  More 
specifically,  verbal  tests  developed  in  the  Guilford  tradition  (e.g., 
unusual  uses  or  consequences  tasks;  see  Guilford,  1967)  resulted  in 
more  than  two  times  higher  correlations  (r  =  .30)  with  academic 
achievement  than  did  figural  (e.g.,  TCT-DP,  see  Urban,  1991)  tests 
(r  =  .14).  Moreover,  the  use  of  standardized  academic  achieve¬ 
ment  tests  resulted  in  higher  correlations  with  creativity  (r  =  .28) 
compared  to  the  use  of  GPA  (r  —  .19).  In  addition,  the  academic 
stage  of  middle  school  (r  =  .33)  resulted  in  higher  correlations 
between  creativity  and  academic  achievement  compared  to  ele¬ 
mentary  schools  (r  =  .23),  high  schools  (r  =  .21),  and  universities 
(r  =  -17). 

Finally,  the  results  of  our  ANOVA  analog  analysis  also  indi¬ 
cated  significant  differences  in  the  strength  of  creativity  and  aca¬ 
demic  achievement  between  continents  ( Q  =  32.58,  df  =  5,  p  < 
.001).  This  finding,  however,  suggests  that  it  is  an  artifact  caused 
by  the  lack  of  control  for  differences  in  the  characteristics  of 
studies.  Indeed,  as  our  previous  analysis  indicates  (see  the  online 
supplemental  material  Table  SI),  when  properly  controlling  for 
between-country  differences  in  creativity  and  academic  achieve¬ 
ment  measurement,  continent  does  not  significantly  influence  the 
obtained  the  effect  size. 

Discussion 

The  goal  of  this  meta-analysis  was  to  clarify  the  somewhat 
mixed  findings  of  previous  research  that  has  examined  the  rela¬ 
tionship  between  creativity  and  academic  achievement.  More  spe¬ 
cifically,  we  endeavored  to  obtain  a  stable  estimate  of  the  direction 
and  magnitude  of  the  relationship  between  creativity  and  academic 
achievement.  In  addition,  we  had  the  aim  of  examining  the  influ¬ 
ence  of  potential  moderators  on  this  observed  relationship. 


What  Is  the  Relationship  Between  Creativity  and 
Academic  Achievement? 

With  respect  to  the  relationship  between  creativity  and  academic 
achievement,  our  results  indicate  that  there  is  a  modest  but  signif¬ 
icantly  positive  association  (r  =  .22)  in  the  studies  we  analyzed. 
Moreover,  our  analyses  indicate  that  this  relationship  was  not 
influenced  by  publication  bias.  These  findings  align  with  long¬ 
standing  assertions  of  scholars  who  have  described  creativity  and 
learning  as  representing  interrelated  phenomena  (e.g.,  Beghetto, 
2016a;  Guilford,  1967;  Piaget,  1962,  1981;  Sawyer,  2012;  Vy¬ 
gotsky,  1967/2004).  The  modest  magnitude  of  this  relationship 
(r  =  .22),  however,  raises  questions  as  to  why  the  observed 
association  was  so  low.  Indeed,  this  relationship  only  explains  5% 
of  the  variance  in  creativity  and  academic  achievement.  With  so 
much  unaccounted  for  variance,  it  is  important  to  consider  what 
factors  might  be  influencing  this  relationship.  The  results  of  our 
moderator  analysis  help  shed  some  light  on  this  issue.  In  the 
sections  that  follow,  we  discuss  the  results  of  our  moderator 
analysis  and  conclude  with  a  brief  discussion  of  strengths,  limita¬ 
tions,  and  future  directions  for  this  line  of  research. 

What  Is  the  Influence  of  Different  Types  of  Measures? 

Conceptually  speaking,  one  of  the  clearest  factors  that  can 
influence  the  observed  relationship  between  creativity  and  aca¬ 
demic  achievement  is  how  the  constructs  are  measured.  Our  results 
indicate  that  the  relationship  between  creativity  and  academic 
achievement  was  significantly  stronger  when  creativity  was  mea¬ 
sured  with  tests,  r  =  .23,  95%  Cl  [.20,  .26] — particularly  verbal 
tests,  r  =  .30,  95%  Cl  [.25,  .34] — compared  to  when  it  was 
measured  using  self-report  scales,  r  =  .12,  95%  Cl  [.07,  .17].  That 
test-based  measures  would  have  a  stronger  influence  on  the  rela¬ 
tionship  between  creativity  and  academic  achievement  is  not  sur- 


9  Although  this  analytic  technique  does  not  control  for  the  associations 
and  shared  variance  between  moderators  (and  is  therefore  less  robust  than 
previously  reported  multilevel  regression  models),  it  provides  results  (i.e., 
effects  in  terms  of  averaged  correlations),  which  tend  to  be  easier  for 
readers  to  interpret. 
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Table  4 


Moderator  Analysis:  Education  Stage 


Effects 

Estimate 

SE 

95%  Cl 

LL  UL 

P 

Fixed  effects 

Intercept 

.17 

.04 

.08 

.25 

<.001 

Creativity  measurement  (0  =  self-report,  1  =  test) 

.08 

.03 

.02 

.14 

.01 

Academic  achievement  measurement  (0  =  test,  1  =  GPA) 

-.03 

.02 

-.07 

.004 

.08 

Education  stage  (elementary  =  reference  category) 

Middle  school 

.12 

.05 

.02 

.21 

.015 

High  school 

.004 

.04 

-.08  , 

.08 

.93 

College/university 

-.04 

.04 

-.11 

.04 

.36 

Random  effects 

Within-study  variance 

.01 

.001 

.008 

.01 

<.001 

Between-study  variance 

.02 

.003 

.011 

.022 

<.001 

Note.  Estimated  on  1 12  studies  and  628  correlations.  Cl  =  confidence  interval;  LL  —  lower  limit;  UL  =  upper 
limit;  GPA  =  grade  point  average. 


prising.  Indeed,  as  we  noted  earlier,  cognitive  characteristics  rel¬ 
evant  to  creative  ability,  such  as  the  fluency,  flexibility,  and 
originality  of  thinking  (Guilford,  1967);  imagination  (Jankowska 
&  Karwowski,  2015);  induction  and  deduction  abilities  (Weisberg, 
2006);  and  the  use  of  specific  problem-solving  strategies  play  a 
considerable  role  in  the  learning  process  (Chamot,  Dale,  O’Malley, 
&  Spanos,  1992;  Hmelo-Silver,  2004).  As  such,  our  results  provide 
further  evidence  of  the  potentially  positive  role  that  creativity  can 
play  in  the  acquisition,  consolidation,  and  processing  of  new 
knowledge — including  school  knowledge  (Hennessey  &  Amabile, 
1987). 

We  also  found  that  obtained  effect  size  differed  depending  on 
the  type  of  academic  achievement  measure  used.  More  specifi¬ 
cally,  when  the  criterion  of  achievement  was  GPA,  the  effect  was 
significantly  weaker,  r  =  .19,  95%  Cl  [.16,  .22],  compared  to  when 


achievement  was  measured  using  standardized  achievement  tests, 
r  =  .28,  95%  Cl  [.22,  .34],  This  difference  may  be  caused  by 
various  factors.  It  may  reflect  the  lower  reliability  of  school  grades 
compared  to  standardized  achievement  tests  (Elliott  &  Strenta, 
1988).  In  a  majority  of  the  meta-analyzed  studies  (especially  the 
early  ones),  data  concerning  the  reliability  of  grades  were  not 
given,  and  therefore  we  were  unable  to  estimate  the  corrected 
correlations. 

It  is  also  possible,  however,  that  this  difference  has  substantive 
meaning.  One  reason  why  the  correlation  between  creativity  and 
grades  was  lower  than  the  correlation  between  creativity  and  more 
objective  academic  achievement  tests  (Organisation  for  Economic 
Co-operation  and  Development  [OECD],  2014)  is  because  the 
willingness  to  express  one’s  creativity  can  be  influenced  by  subtle 
environmental  features  of  the  classroom  (Amabile,  1996;  Beghetto 


Table  5 

Moderator  Analysis:  Figural  vs.  Verbal  Creativity  Tests 


95%  Cl 


Effects 

Estimate 

SE 

LL 

UL 

P 

Fixed  effects 

Intercept 

-.040 

.131 

-.296 

.216 

.76 

Year  (grand  centered) 

.001 

.001 

-.001 

.003 

.36 

Goal  (0  =  other,  1  =  creativity  X  achievement) 

.018 

.033 

-.048 

.083 

.59 

Published?  (0  =  no,  1  =  yes) 

.004 

.035 

-.064 

.073 

.90 

Academic  achievement  measurement  (0  =  test,  1  = 

GPA)  -.020 

.020 

-.060 

.019 

.31 

Test  type  (figural  =  0,  verbal  =  1) 

.170 

.017 

.136 

.203 

<.001 

School  subjects  (sport  =  reference  category) 
Humanistic 

.200 

.124 

-.043 

.443 

.11 

Science 

.189 

.124 

-.055 

.432 

.13 

Overall 

.164 

.127 

-.084 

.413 

.20 

Creative  abilities  (other  +  general  =  reference) 
Fluency 

-.038 

.023 

-.083 

.007  » 

.10 

Flexibility 

-.024 

.025 

-.073 

.025 

.35 

Originality 

-.029 

.024 

-.076 

.018 

.23 

Elaboration 

.018 

.030 

-.041 

.077 

.56 

Random  effects 

Within-study  variance 

.009 

.001 

.007 

.011 

<.001 

Between-study  variance 

.017 

.003 

.011 

.023 

<.001 

Note.  Estimated  on  90  studies  and  617  correlations.  Cl  =  confidence  interval;  LL  =  lower  limit;  UL  =  upper 
limit;  GPA  =  grade  point  average. 
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Table  6 


Meta-Analysis  Analog  of  ANOVA:  Summary  of  Moderators 


Moderator 

k 

N 

r 

95%  Cl 

Heterogeneity  (0a 

Decade  ( Q  =  8.83,  df  =  5,  p  =  .12) 

1960-1969 

19 

5,378 

.25*** 

[.18,  .32] 

126.91*** 

1970-1979 

8 

17,198 

|  <y  *** 

[.09,  .26] 

46.37*** 

1980-1989 

9 

1,121 

.20** 

[.09,  .31] 

29.56*** 

1990-1999 

7 

3,024 

.15*** 

[-11,  .19] 

6.25 

2000-2009 

35 

10,239 

[.15,  .26] 

273.57*** 

2010-2015 

42 

21,711 

<** 

[.18.  .271 

377.42*** 

Region  ( Q  =  32.58,  df  =  5,  p  <  .001) 

Africa 

2 

539 

.03 

[-.06,  .11] 

.39 

South  America 

1 

141 

.16 

[-.01,  .32] 

NA 

North  America 

61 

30,299 

.22*** 

[.18,  .26] 

427.24*** 

Australia 

1 

855 

.32*** 

[.26,  .38] 

NA 

Asia 

14 

3,852 

ft** 

[.16,  .38] 

155.63*** 

Europe 

41 

22,985 

.20*** 

[.16,  .24] 

272.55*** 

Type  of  creative  ability  mode  ( Q  =  26.94,  df  =  2,  p  <  .001) 

Verbal 

42 

18,929 

.30*** 

[.25,  .34] 

438.16*** 

Nonverbal 

28 

10,451 

.14*** 

[.10,  .181 

62.93*** 

Creativity  test  (Q  =  10.44,  df  =  3,  p  =  .02) 

Guilford 

30 

11,125 

.26*** 

[.21,  .31] 

205.19*** 

TCT-DP 

15 

3,929 

.18*** 

[-14,  .21] 

13.78 

TTCT 

22 

3,746 

.20*** 

[.15,  .25] 

54.66*** 

Other 

25 

16,306 

.27*** 

[.21,  .34] 

394.70*** 

Academic  achievement  measure  ( Q  =  6.27,  df  =  1,  p  =  .01) 

GPA 

73 

35,341 

[.16,  .22] 

412.66*** 

Achievement  tests 

31 

11,328 

.28*** 

[.22,  .34] 

322.04*** 

Education  stage  ( Q  =  16.44,  df—3>,p  =  .001) 

Elementary 

26 

10,906 

.23*** 

[-17,  .29] 

204.21*** 

Middle  school 

15 

8,511 

.33*** 

[.27,  .39] 

204.73*** 

High  school 

28 

21,559 

2 1  *** 

[.16,  .26] 

148.82*** 

College/university 

42 

11,602 

[.12,  .22] 

287.67*** 

Note.  A  meta-analysis  analog  of  analysis  of  variance  (ANOVA)  was  used  with  a  study  as  a  unit  of  analysis,  k  —  the  number  of  studies  included  in  the 
analysis;  N  =  sample  size.  Cl  =  confidence  interval;  NA  =  Not  Applicable;  TCT-DP  =  Test  of  Creative  Thinking-Drawing  Production;  TTCT  =  Torrance 
Test  of  Creative  Thinking;  GPA  =  grade  point  average. 
a  df  for  Q  statistic  is  the  number  of  studies  (k)  -  1 . 

***/>  <  .001. 


&  Kaufman,  2014;  Hennessey,  2010).  Teachers  who,  for  instance, 
prioritize  students’  ability  to  meet  predetermined  task  expectations 
(over  originality)  when  assessing  students’  work  send  subtle  mes¬ 
sages  to  students  that  originality  is  not  necessary  or  perhaps  not 
wanted  (Beghetto,  2013).  Consequently,  students  may  learn  that  it 
is  not  worth  the  risk  or  effort  to  try  to  be  creative  in  their  responses. 
It  is  also  possible  that  teachers  may  downgrade  more  original  or 
unexpected  responses.  Indeed,  there  is  evidence  that  teachers 
sometimes  hold  negative  views  about  student  behaviors  associated 
with  creativity  (Gralewski  &  Karwowski,  2013,  2016;  Karwowski, 
2007,  2010;  Scott,  1999;  Westby  &  Dawson,  1995).  Regardless  of 
the  reason,  it  is  important  to  note  that  the  observed  relationship 
was  still  positive  (albeit,  somewhat  modest). 

Taken  together,  these  findings  help  illustrate  the  importance  of 
the  types  of  measures  used  to  assess  creativity  and  academic 
achievement.  Indeed,  given  the  theoretical  links  between  creativity 
and  learning,  one  might  expect  a  stronger  correlation  than  what  we 
found.  With  respect  to  creativity,  the  most  popular  measures  tend 
to  focus  on  divergent  thinking  (i.e.,  the  ability  to  produce  original 
ideas)  and  less  on  convergent  thinking  (i.e.,  the  ability  to  meet  task 
constraints)  (see  Barbot,  Besancon,  &  Lubart,  2015,  for  an  excep¬ 
tion).  As  such,  commonly  used  creativity  tests  often  fail  to  repre¬ 
sent  broader  conceptions  of  creativity  (Baer,  2014;  Cropley,  2006), 


which  include  a  combination  of  originality  and  task  constraints 
(Beghetto,  Kaufman,  &  Baer,  2015;  Plucker  et  al.,  2004;  Simon- 
ton,  2012).  Consequently,  such  measures  are  a  bit  too  narrow  in 
what  they  measure.  The  same  can  be  said  for  self-assessments  of 
creativity. 

Indeed,  it  may  be  the  case  that  self-assessments  also  suffer  from 
a  form  of  “originality  bias”  (Beghetto,  2010;  Runco  &  Acar,  2010) 
wherein  they  emphasize  the  more  divergent  aspects  of  creativity  at 
the  expense  of  the  more  convergent  aspects  of  creativity.  Given 
that  academic  measures  tend  to  focus  more  on  convergence  (i.e., 
meeting  task  constraints,  providing  expected  results),  the  use  of 
overly  narrow  measures  of  creativity  may  result  in  systematically 
suppressed  estimates  of  the  observed  relationship  between  creativ¬ 
ity  and  academic  achievement.  At  this  point,  such  assertions  are 
somewhat  speculative  and  therefore  warrant  attention  in  future 
studies.  As  such,  future  research  should  focus  on  developing  and 
testing  measures  of  creativity  that  more  adequately  represent  the 
creative  combination  of  divergent  and  convergent  thinking  (see 
Barbot  et  al.,  2015;  Lubart  &  Besan?on,  in  press).  Doing  so  may 
help  clarify  whether  there  is  a  stronger  empirical  relationship 
between  creativity  and  academic  achievement  than  what  is  other¬ 
wise  represented  in  more  traditional  measures. 
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With  respect  to  academic  achievement,  future  studies  should 
also  use  more  precise  measures  of  academic  achievement.  In  the 
case  of  GPA,  it  is  frequently  effort  (Brookhart,  1997),  progress 
(Nitko,  2001),  or  even  the  student’s  adjustment  to  the  teacher’s 
demands  (Wortham,  2004)  that  are  evaluated.  Moreover,  given 
that  creative  students  sometimes  approach  learning  tasks  in  unex¬ 
pected  and  unorthodox  ways  (Beghetto,  2013,  2016a;  Gtin^er  & 
Oral,  1993;  Karwowski  &  Jankowska,  in  press),  their  GPAs  may 
be  negatively  influenced  by  failing  to  meet  behavioral  expectations 
(rather  than  a  reflection  of  academic  ability).  Measures  of  aca¬ 
demic  achievement  that  more  clearly  focus  on  learning  gains 
(rather  than  meeting  teachers’  expectations  for  obtaining  those 
gains)  might  provide  a  more  accurate  assessment  of  student  learn¬ 
ing  and  thereby  more  accurately  reflect  the  relationship  between 
student  creativity  and  academic  achievement. 

What  Is  the  Effect  of  Education  Stage? 

Our  results  indicate  that  the  influence  of  education  stage  on  the 
relationship  between  creativity  and  academic  achievement  is  sim¬ 
ilar  across  most  stages,  with  the  exception  of  middle  school  (r  = 
.33).  Why  might  this  be  the  case?  Classic  (Torrance,  1968)  and 
more  contemporary  (Krampen,  2012)  analyses  suggest  that  al¬ 
though  there  may  be  declines  in  creativity  development  in  child¬ 
hood,  there  seems  to  be  rather  systematic  growth  in  creative  ability 
from  puberty  onward  (Claxton,  Pannells,  &  Rhoads,  2005;  Mil- 
gram  &  Hong,  1999).  Even  though  there  is  some  evidence  of 
higher  levels  of  creativity  in  elementary  school  students  compared 
to  middle  school  students  (Yi,  Hu,  Plucker,  &  McWilliams,  2013), 
middle  school  students  may,  on  average,  experience  a  boost  in 
creative  ability.  This  assertion  has  a  basis  in  developmental  theory 
(Feldman,  2003)  and  in  neuropsychology  (Barbot  &  Tinio,  2014). 
The  middle  school  years  are,  for  instance,  thought  to  be  a  key 
developmental  period  for  thinking  skills,  which  are  then  measured 
in  students’  skills  assessment  programs  such  as  Programme  for 
International  Student  Assessment  (OECD,  2014).  Although  studies 
have  demonstrated  an  increase  in  thinking  skills  starting  in  ele¬ 
mentary  school  (Molnar,  Greiff,  &  Csapo,  2013),  the  most  pro¬ 
nounced  development  of  these  skills  tends  to  be  the  middle  school 
years  (Csapo,  1997).  This  is  not  to  say  that  middle  school  years  are 
free  from  declines  or  creative  suppression  (Beghetto  &  Dilley, 
2016),  but  prior  work  suggests  that  these  years  of  development 
may  serve  as  a  key  time  of  growth  in  creative  abilities  (Barbot, 
Lubart,  &  Besancon,  2016;  Kleibeuker,  De  Dreu,  &  Crone,  2013). 
Such  assertions,  however,  warrant  further  empirical  exploration. 

Our  findings  also  indicate  higher  correlations  in  the  middle 
school  stage  of  education  compared  to  high  school  and  universi¬ 
ties.  This  finding  has  less  theoretical  and  empirical  support  than 
the  observed  difference  between  elementary  and  middle  school. 
One  possible  explanation  is  that  learning  becomes  increasingly 
more  specialized  at  higher  levels  of  education.  The  majority  of  the 
studies  included  in  our  meta-analysis  used  general  rather  than 
discipline-specific  measures  of  creative  potential,  which  tend  to 
have  lower  levels  of  predictive  validity  when  explaining  more 
specialized  academic  achievement  (see  Baer,  2014,  in  press).  The 
fact  that  we  did  not  observe  differences  in  the  strength  of  the 
relationship  between  various  dimensions  of  school  functioning 
may  also  be  an  indication  that  domain-general  measures  of  cre¬ 
ative  ability — which  tended  to  be  operationalized  as  a  form  of 


divergent  thinking  (i.e.,  fluency,  flexibility,  elaboration,  and  orig¬ 
inality  of  thinking) — were  not  sensitive  enough  to  provide  differ¬ 
ential  estimations  of  academic  achievement  across  disciplines. 

Once  again,  these  findings  point  to  the  importance  of  the  sen¬ 
sitivity  and  scope  of  the  measures  used  to  assess  creativity  and 
academic  achievement.  Indeed,  both  creativity  and  learning  re¬ 
searchers  tend  to  be  in  agreement  that  creativity  and  learning  are 
domain  specific  (Alexander,  1995;  Baer,  2014,  in  press;  Beghetto 
et  al.,  2015;  Poitras  &  Lajoie,  2013).  Future  research  should 
therefore  use  domain-specific  measures  to  examine  whether  such 
measures  influence  the  observed  relationship  between  creativity 
and  learning  and  whether  there  are  potentially  important  differ¬ 
ences  across  domains. 

What  Is  the  Influence  of  Time  and  Place? 

Finally,  we  examined  the  potential  influence  of  time  (i.e.,  when 
the  study  was  conducted)  and  place  (i.e.,  what  country  or  continent 
the  study  was  conducted).  Our  findings  indicate  that  the  relation¬ 
ship  between  creativity  and  academic  achievement  was  stable 
across  time  and  place.  This  finding  differs  from  the  results  of 
previous  research,  which  have  suggested  that  creativity  may  be 
declining  over  time  (Kim,  2011)  and  that  creativity  is  often  con¬ 
ceptualized  and  experienced  differently  across  cultures  (Kaufman 
&  Sternberg,  2006). 

When  interpreting  these  findings,  it  is  important  to  point  out  that 
the  analyses  conducted  here  and  in  related  studies  (e.g.,  Kim, 
2005)  are  cross-sectional.  Without  longitudinal  data,  it  is  difficult 
(if  not  impossible)  to  make  any  definitive  claims  about  the  rela¬ 
tionship  between  creativity  and  academic  achievement  across  time. 
Moreover,  the  studies  we  analyzed  did  not  have  the  goal  of 
providing  direct  comparisons  across  cultures,  and  as  such,  cultural 
differences  that  may  influence  creativity  and  academic  achieve¬ 
ment  may  not  have  been  adequately  assessed  or  represented  in  the 
studies  we  analyzed.  Consequently,  strong  claims  about  the  influ¬ 
ence  of  time  and  culture  are  not  appropriate  until  additional  re¬ 
search  is  conducted,  which  focuses  specifically  on  addressing  the 
impact  of  time  (measured  longitudinally)  and  the  impact  of  culture 
(using  more  direct  cross-cultural  comparisons).  Our  findings,  how¬ 
ever,  do  provide  a  starting  point  for  researchers  to  examine 
whether  and  under  what  conditions  the  positive  relationship  be¬ 
tween  creativity  and  academic  achievement  is  stable  across  time 
and  place. 

Strengths  and  Limitations  of  the  Present  Study 
Strengths 

A  strong  point  of  our  meta-analysis  is  that  it  serves  as  the  first 
study  to  provide  a  stable  estimate  of  the  relationship  between 
creativity  and  academic  achievement.  Consequently,  this  study 
contributes  much-needed  clarification  on  this  relationship.  Another 
key  strength  is  the  scope  of  the  study.  More  specifically,  our 
results  cover  a  wide  range  of  temporal  (1962-2015),  territorial 
(studies  from  all  over  the  world),  and  numerical  (120  independent 
studies,  782  effects,  and  the  total  joint  sample  exceeding  52,000 
participants)  factors.  In  fact,  this  study  represents  one  of  the  largest 
meta-analyses  in  the  creativity  literature  to  date.  We  also  consider 
the  analytic  models  applied  (multilevel  meta-analysis)  to  be  an 
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advantage.  Indeed,  multilevel  models  enabled  us  to  provide  more 
robust  estimations  of  the  observed  effects  and  the  effects  of  key 
moderators. 

Limitations 

A  disadvantage  of  this  meta-analysis  was  the  limited  number  of 
moderators  we  were  able  to  include.  There  are  several  moderating 
factors  (e.g.,  instructional  approach,  curriculum  used,  contextual 
influences  of  schools  and  classrooms,  and  measures  of  various 
individual  differences,  such  as  student  and  teacher  beliefs)  that 
may  have  shed  additional  light  on  factors  that  influence  the  rela¬ 
tionship  between  creativity  and  academic  achievement.  Additional 
studies  are  therefore  needed  that  take  into  account  these  additional 
individual  and  sociocultural  factors. 

The  unavailability  of  relevant  data  at  the  level  of  individual 
studies  was  also  a  limitation  (e.g.,  the  reliability  of  academic 
achievement  measures).  The  lack  of  these  data  prevented  us  from 
being  able  to  make  corrections  to  the  obtained  effects.  Future 
researchers  (and  journal  reviewers)  are  therefore  well  advised  to 
report  (and  require  the  reporting  of)  relevant  psychometric  data  on 
all  measures  so  that  such  corrections  can  be  made. 

Perhaps  the  most  severe  limitation  of  this  synthesis  was  our 
inability  to  properly  control  for  a  number  of  mediators  and  con¬ 
founding  variables  at  the  level  of  individual  studies.  This  is  a 
limitation  that  plagues  meta-analytic  studies  more  generally.  One 
way  to  help  address  this  issue  is  for  researchers  to  ensure  that  their 
studies  include  as  many  theoretically  important  predictors  of  aca¬ 
demic  achievement  in  one  study  as  possible.  In  the  case  of  cre¬ 
ativity,  this  would  include  factors  such  as  intelligence  and  person¬ 
ality  (Chamorro-Premuzic  &  Fumham,  2008;  Day,  Hanson, 
Maltby,  Proctor,  &  Wood,  2010),  thinking  styles  (Zhang,  2004, 
2010,  2012;  Zhang  &  Sternberg,  2005),  motivational  factors  (Ban¬ 
dura,  1997;  Hill  &  Amabile,  1993;  Karwowski,  2011,  2012,  2014; 
Kaufman  &  Beghetto,  2013),  and  contextual  factors  (Beghetto  & 
Kaufman,  2014;  Schacter,  Thum,  &  Zifkin,  2006). 

As  already  mentioned,  longitudinal  studies,  using  more  precise 
measures,  are  particularly  needed.  Longitudinal  studies,  although 
costly  in  terms  of  time  and  resources,  would  pay  out  in  the  form  of 
being  able  to  provide  needed  insights  into  how  creativity  and 
academic  achievement  grow  and  develop  over  time.  Such  studies 
would  also  enable  researchers  to  empirically  test  various  proposed 
theoretical  links  between  creativity  and  academic  achievement 
(Beghetto,  2016a),  including  whether  the  relationship  is  best 
thought  of  as  unidirectional  (e.g.,  creativity  — >  academic  achieve¬ 
ment;  academic  achievement  creativity)  or  reciprocal  (e.g., 
creativity  < - >  academic  achievement). 

A  final  limitation  we  feel  important  to  highlight  pertains  to  the 
possibility  of  a  nonlinear  relationship  between  creativity  and  aca¬ 
demic  achievement.  Such  a  relationship  cannot  be  fully  captured  in 
the  types  of  data  (correlation  coefficients)  and  analyses  used  in  this 
study.  A  nonlinear  pattern  should  therefore  not  be  ruled  out. 
Indeed,  there  is  evidence  that  such  patterns  exist  between 
creativity  and  related  constructs,  such  as  creativity  and  intelli¬ 
gence  (see  Jauk,  Benedek,  Dunst,  &  Neubauer,  2013;  Kar¬ 
wowski  &  Gralewski,  20 13). 10  Consequently,  subsequent  work 
should  explore  possible  nonlinear  patterns  in  the  relationship 
between  creativity  and  academic  achievement  using  analytic  tech¬ 


niques  such  as  segmented  regression  (Jauk  et  ah,  2013)  or  a 
“necessary  condition  analysis”  (Dul,  2016). 

Concluding  Thoughts 

For  more  than  six  decades,  the  question  of  whether  creativity 
and  academic  achievement  are  related  has  been  a  focus  of  theo¬ 
retical  and  empirical  work  in  educational  psychology.  This  ques¬ 
tion  has  proven  to  be  a  thorny  one,  complicated  by  various  types 
of  measures  and  potentially  intervening  factors.  Not  surprisingly, 
the  results  of  previous  research  have  run  the  gamut  from  positively 
related,  unrelated,  and  even  negatively  related.  The  upshot  of  a 
decade’s  worth  of  research  on  this  question  is  that  it  provided 
numerous  effects  that  we  were  able  to  analyze  using  robust  meta- 
analytic  techniques  and  thereby  take  an  important  step  in  the 
direction  of  addressing  the  longstanding  question  of  whether  cre¬ 
ativity  and  academic  achievement  are  related. 

Indeed,  prior  to  this  study,  the  question  of  whether  there  is  a 
relationship  between  creativity  and  academic  achievement  could 
best  be  answered  with  the  equivocal  response  of,  “It  depends.” 
Based  on  the  findings  from  this  meta-analysis,  we  can  now  more 
confidently  respond,  “Previous  research  has,  on  average,  demon¬ 
strated  a  positive  (albeit  modest)  relationship  between  creativity 
and  academic  achievement,  which  is  significantly  moderated  by 
the  types  of  measures  used  to  assess  creativity  and  academic 
achievement.”  This,  of  course,  does  not  mean  that  the  question  is 
now  closed.  Rather,  the  results  of  the  present  study  provide  re¬ 
searchers  with  a  baseline  correlation  that  they  can  use  in  subse¬ 
quent  research  for  comparison  and  further  exploration. 

The  next  logical  step  is  to  continue  to  design  studies  that 
examine  the  stability  of  this  estimate  and  more  carefully  examine 
what  additional  factors  might  influence  this  relationship.  We  have 
already  pointed  to  several  needed  directions  for  future  study.  One 
of  the  most  important  future  directions  pertains  to  developing  and 
examining  the  influence  of  more  precise  measures  of  creativity  and 
academic  achievement.  Such  work,  however,  is  not  purely  empir¬ 
ical.  Complementary  theoretical  work  is  also  needed  to  help  spec¬ 
ify  how  and  to  what  extent  creativity  and  academic  achievement 
are  related  phenomena.  Educational  psychologists  can  play  a  key 
role  in  this  endeavor  by  working  alongside  creativity  researchers  to 
develop  more  detailed  theoretical  models  that  help  specify  the 
relationship  between  creativity  and  academic  achievement  and 
also  help  develop  more  sensitive  measures  that  can  test  and  further 
clarify  these  asserted  relationships.  Doing  so  will  provide  addi- 


10  The  nonlinear  relationship  between  creativity  and  cognitive  abilities, 
such  as  intelligence,  has  been  asserted  by  some  of  the  earliest  theorists 
(e.g.,  Guilford,  1967).  Some  theorists  have  posited  a  so-called  threshold 
hypothesis  (see  Jauk  et  al.,  2013;  Karwowski  &  Gralewski,  2013;  Preckel, 
Holling,  &  Wiese,  2006).  This  hypothesis  asserts  a  positive  relationship 
between  creativity  and  intelligence  only  in  the  groups  of  individuals  whose 
intelligence  level  is  below  an  IQ  of  120,  whereas  above  this  threshold,  the 
correlation  is  expected  to  disappear  or  weaken  significantly  (Guilford, 
1967).  Consequently,  the  threshold  hypothesis  does  not  assume  linear 
association  but  rather  a  curvilinear  inverted  J-shaped  relationship  between 
intelligence  and  creativity.  Similar  thresholds  may  exist  in  the  relationship 
between  creativity  and  academic  achievement,  such  as  high  levels  of 
academic  achievement  suppressing  creativity  (see  Simonton,  in  press)  or, 
conversely,  high  levels  of  creativity  negatively  influencing  academic 
achievement  (Kim,  2008).  We  thank  an  anonymous  reviewer  for  highlight¬ 
ing  this  possibility. 
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tional  insights  into  the  longstanding  question  of  how  creativity  and 

academic  achievement  are  related. 
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